public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Perry Yuan <perry.yuan@amd.com>,
	Yifan Zhang <yifan1.zhang@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Sasha Levin <sashal@kernel.org>,
	lijo.lazar@amd.com, christian.koenig@amd.com,
	Hawking.Zhang@amd.com, mario.limonciello@amd.com,
	cesun102@amd.com, superm1@kernel.org, kenneth.feng@amd.com,
	kevinyang.wang@amd.com, Yifan.Zha@amd.com, siqueira@igalia.com,
	ilya.zlobintsev@gmail.com, tomasz.pakula.oficjalny@gmail.com,
	Jesse.Zhang@amd.com
Subject: [PATCH AUTOSEL 6.18-6.12] drm/amd/pm: Disable MMIO access during SMU Mode 1 reset
Date: Mon, 12 Jan 2026 09:58:12 -0500	[thread overview]
Message-ID: <20260112145840.724774-11-sashal@kernel.org> (raw)
In-Reply-To: <20260112145840.724774-1-sashal@kernel.org>

From: Perry Yuan <perry.yuan@amd.com>

[ Upstream commit 0de604d0357d0d22cbf03af1077d174b641707b6 ]

During Mode 1 reset, the ASIC undergoes a reset cycle and becomes
temporarily inaccessible via PCIe. Any attempt to access MMIO registers
during this window (e.g., from interrupt handlers or other driver threads)
can result in uncompleted PCIe transactions, leading to NMI panics or
system hangs.

To prevent this, set the `no_hw_access` flag to true immediately after
triggering the reset. This signals other driver components to skip
register accesses while the device is offline.

A memory barrier `smp_mb()` is added to ensure the flag update is
globally visible to all cores before the driver enters the sleep/wait
state.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 7edb503fe4b6d67f47d8bb0dfafb8e699bb0f8a4)
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Analysis of Commit: drm/amd/pm: Disable MMIO access during SMU Mode 1
reset

### 1. COMMIT MESSAGE ANALYSIS

The commit message clearly describes a **serious stability problem**:
- During Mode 1 reset, the ASIC becomes temporarily inaccessible via
  PCIe
- Any MMIO access during this window (from interrupt handlers or other
  driver threads) results in **NMI panics or system hangs**
- Keywords: "NMI panics", "system hangs" - these are critical stability
  issues

The commit has proper attribution:
- Reviewed-by tag from Yifan Zhang
- Signed-off by AMD maintainer Alex Deucher
- Already cherry-picked, suggesting it's considered stable material

### 2. CODE CHANGE ANALYSIS

The fix is straightforward and surgical:

**smu_v13_0_0_ppt.c and smu_v14_0_2_ppt.c:**
```c
/* disable mmio access while doing mode 1 reset*/
smu->adev->no_hw_access = true;
/* ensure no_hw_access is globally visible before any MMIO */
smp_mb();
msleep(...);
```
Sets the existing `no_hw_access` flag to block MMIO from other driver
components during the reset sleep window. The `smp_mb()` ensures proper
cross-CPU memory ordering.

**amdgpu_device.c:**
```c
/* enable mmio access after mode 1 reset completed */
adev->no_hw_access = false;
```
Re-enables MMIO access after the reset completes successfully.

### 3. CLASSIFICATION

This is a **bug fix** - not a new feature. It uses an existing mechanism
(`no_hw_access` flag) that other driver components already check. The
fix simply sets this flag at the correct time during Mode 1 reset.

### 4. SCOPE AND RISK ASSESSMENT

- **Lines changed**: ~15 lines of actual code across 3 files
- **Complexity**: Very low - just boolean flag setting and memory
  barriers
- **Risk**: LOW - uses existing infrastructure, pattern is well-
  understood
- **Subsystem**: AMD GPU power management (SMU v13.0.0 and v14.0.2)

The changes are minimal and isolated to the Mode 1 reset path.

### 5. USER IMPACT

- **Severity**: CRITICAL - NMI panics and system hangs affect system
  stability
- **Affected users**: AMD GPU users (Radeon RX 7000 series and newer)
  when GPU undergoes error recovery
- **Trigger likelihood**: Mode 1 resets happen during GPU error recovery
  - interrupt handlers racing with reset is a realistic scenario

### 6. STABILITY INDICATORS

- Has `Reviewed-by` tag
- Cherry-picked from mainline commit
  7edb503fe4b6d67f47d8bb0dfafb8e699bb0f8a4
- Comes from AMD engineers with domain expertise

### 7. DEPENDENCY CHECK

The `no_hw_access` field in `amdgpu_device` structure is existing
infrastructure used throughout the AMDGPU driver. This commit doesn't
introduce new fields or dependencies - it only sets an existing flag at
appropriate times.

**Potential concern**: SMU v14 code is relatively new (2024+), so very
old stable trees may not have this code path. However, if the code
exists in the stable tree, the fix applies cleanly.

### RISK VS BENEFIT ASSESSMENT

**Benefits:**
- Prevents NMI panics and system hangs - critical stability fix
- Protects against realistic race conditions during GPU reset

**Risks:**
- Minimal - uses existing flag mechanism
- Small, contained changes with proper memory barriers
- No architectural changes or new features

### CONCLUSION

This commit fixes a critical stability bug (NMI panics, system hangs)
that can occur during normal GPU error recovery scenarios. The fix is:
- Small and surgical (just flag setting + memory barrier)
- Uses existing infrastructure
- Properly reviewed
- Low regression risk
- Already cherry-picked indicating stable intent

The commit meets all stable kernel criteria: it's obviously correct,
fixes a real and serious bug, is small and contained, and introduces no
new features.

**YES**

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c           | 3 +++
 drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c | 7 ++++++-
 drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c | 9 +++++++--
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0cba31ec025c0..e9a14a62e73fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5731,6 +5731,9 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
 	if (ret)
 		goto mode1_reset_failed;
 
+	/* enable mmio access after mode 1 reset completed */
+	adev->no_hw_access = false;
+
 	amdgpu_device_load_pci_state(adev->pdev);
 	ret = amdgpu_psp_wait_for_bootloader(adev);
 	if (ret)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
index c1062e5f03936..8d070a9ea2c10 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
@@ -2922,8 +2922,13 @@ static int smu_v13_0_0_mode1_reset(struct smu_context *smu)
 		break;
 	}
 
-	if (!ret)
+	if (!ret) {
+		/* disable mmio access while doing mode 1 reset*/
+		smu->adev->no_hw_access = true;
+		/* ensure no_hw_access is globally visible before any MMIO */
+		smp_mb();
 		msleep(SMU13_MODE1_RESET_WAIT_TIME_IN_MS);
+	}
 
 	return ret;
 }
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c
index 086501cc5213b..2cb2d93f9989a 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0_2_ppt.c
@@ -2142,10 +2142,15 @@ static int smu_v14_0_2_mode1_reset(struct smu_context *smu)
 
 	ret = smu_cmn_send_debug_smc_msg(smu, DEBUGSMC_MSG_Mode1Reset);
 	if (!ret) {
-		if (amdgpu_emu_mode == 1)
+		if (amdgpu_emu_mode == 1) {
 			msleep(50000);
-		else
+		} else {
+			/* disable mmio access while doing mode 1 reset*/
+			smu->adev->no_hw_access = true;
+			/* ensure no_hw_access is globally visible before any MMIO */
+			smp_mb();
 			msleep(1000);
+		}
 	}
 
 	return ret;
-- 
2.51.0


  parent reply	other threads:[~2026-01-12 14:59 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-12 14:58 [PATCH AUTOSEL 6.18] HID: Elecom: Add support for ELECOM M-XT3DRBK (018C) Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18] x86/sev: Disable GCOV on noinstr object Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.10] wifi: mac80211: collect station statistics earlier when disconnect Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18] btrfs: do not free data reservation in fallback from inline due to -ENOSPC Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.10] btrfs: fix deadlock in wait_current_trans() due to ignored transaction type Sasha Levin
2026-01-19 11:46   ` Motiejus Jakštys
2026-01-20 11:03     ` Greg KH
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.10] HID: quirks: Add another Chicony HP 5MP Cameras to hid_ignore_list Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-6.1] HID: intel-ish-hid: Update ishtp bus match to support device ID table Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.10] HID: multitouch: add MT_QUIRK_STICKY_FINGERS to MT_CLS_VTL Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-6.1] HID: i2c-hid: fix potential buffer overflow in i2c_hid_get_report() Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18] riscv: trace: fix snapshot deadlock with sbi ecall Sasha Levin
2026-01-12 14:58 ` Sasha Levin [this message]
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-6.12] riscv: Sanitize syscall table indexing under speculation Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.15] netfilter: replace -EEXIST with -EBUSY Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-6.12] PCI: qcom: Remove ASPM L0s support for MSM8996 SoC Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.10] ALSA: hda/realtek: add HP Laptop 15s-eq1xxx mute LED quirk Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.10] ring-buffer: Avoid softlockup in ring_buffer_resize() during memory free Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.15] HID: playstation: Center initial joystick axes to prevent spurious events Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.10] HID: intel-ish-hid: Reset enum_devices_done before enumeration Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18] drm/amd/display: Reduce number of arguments of dcn30's CalculatePrefetchSchedule() Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.10] HID: Apply quirk HID_QUIRK_ALWAYS_POLL to Edifier QR30 (2d99:a101) Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-6.1] btrfs: fix reservation leak in some error paths when inserting inline extent Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-6.12] ALSA: hda/realtek: Add quirk for Acer Nitro AN517-55 Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-6.12] HID: logitech: add HID++ support for Logitech MX Anywhere 3S Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18] HID: Intel-thc-hid: Intel-thc: Add safety check for reading DMA buffer Sasha Levin
2026-01-12 14:58 ` [PATCH AUTOSEL 6.18-5.10] HID: usbhid: paper over wrong bNumDescriptor field Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260112145840.724774-11-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Hawking.Zhang@amd.com \
    --cc=Jesse.Zhang@amd.com \
    --cc=Yifan.Zha@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=cesun102@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=ilya.zlobintsev@gmail.com \
    --cc=kenneth.feng@amd.com \
    --cc=kevinyang.wang@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=mario.limonciello@amd.com \
    --cc=patches@lists.linux.dev \
    --cc=perry.yuan@amd.com \
    --cc=siqueira@igalia.com \
    --cc=stable@vger.kernel.org \
    --cc=superm1@kernel.org \
    --cc=tomasz.pakula.oficjalny@gmail.com \
    --cc=yifan1.zhang@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox