patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: "Timur Kristóf" <timur.kristof@gmail.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Sasha Levin" <sashal@kernel.org>,
	lijo.lazar@amd.com, christian.koenig@amd.com,
	Hawking.Zhang@amd.com, mario.limonciello@amd.com,
	alexandre.f.demers@gmail.com, cesun102@amd.com
Subject: [PATCH AUTOSEL 6.17] drm/amd: Disable ASPM on SI
Date: Mon, 27 Oct 2025 20:38:58 -0400	[thread overview]
Message-ID: <20251028003940.884625-14-sashal@kernel.org> (raw)
In-Reply-To: <20251028003940.884625-1-sashal@kernel.org>

From: Timur Kristóf <timur.kristof@gmail.com>

[ Upstream commit 7bdd91abf0cb3ea78160e2e78fb58b12f6a38d55 ]

Enabling ASPM causes randoms hangs on Tahiti and Oland on Zen4.
It's unclear if this is a platform-specific or GPU-specific issue.
Disable ASPM on SI for the time being.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Based on my comprehensive semantic analysis of this commit, here is my
assessment:

## **RECOMMENDATION: YES**

This commit should be backported to stable kernel trees.

---

## **DETAILED ANALYSIS**

### **1. Semantic Code Analysis Performed**

**Tools Used:**
- `mcp__semcode__find_function`: Located the modified
  `amdgpu_device_aspm_support_quirk()` function
- `mcp__semcode__find_callers`: Traced the impact scope through the call
  graph
- `mcp__semcode__find_callchain`: Analyzed complete call chains to
  understand initialization paths
- Git history analysis: Examined related ASPM commits and patterns

**Key Findings:**

**Call Graph Analysis:**
```
amdgpu_device_aspm_support_quirk() [MODIFIED]
  ↓ called by
amdgpu_device_should_use_aspm() [1 caller]
  ↓ called by (9 callers across multiple GPU generations)
  ├─ si_program_aspm() [SI generation - directly affected]
  ├─ vi_program_aspm() [VI generation]
  ├─ cik_program_aspm() [CIK generation]
  ├─ nv_program_aspm() [Navi generation]
  ├─ soc15_program_aspm() [SoC15 generation]
  └─ ... and 4 more hardware initialization functions
```

The change adds an early return when `adev->family == AMDGPU_FAMILY_SI`,
which specifically targets Southern Islands GPUs (Tahiti, Oland, Verde,
Pitcairn, Hainan from ~2012).

### **2. Code Changes Analysis**

**Change Size:** Minimal - only 7 lines added (6 code + 1 blank)
- Lines added: `+6`
- Lines removed: `0`
- Files modified: `1`
  (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1883-1889)

**Change Type:** Conservative quirk addition
- Uses existing quirk infrastructure (function already handles Intel
  Alder Lake/Raptor Lake quirks)
- No refactoring or architectural changes
- Simply adds hardware-specific condition at function entry

### **3. Bug Impact Assessment**

**Severity:** **CRITICAL** - Random system hangs
- Symptom: Random hangs on Tahiti and Oland GPUs
- Platform: Zen4 (AMD Ryzen 7000 series)
- Affected Hardware: SI family GPUs (AMDGPU_FAMILY_SI)

**User Exposure:**
From call chain analysis, the code path is triggered during:
- Hardware initialization (`si_common_hw_init` at
  drivers/gpu/drm/amd/amdgpu/si.c:2640)
- Executed automatically when SI GPU is present in system
- No special user action required to trigger the bug

**Impact Scope:**
- **Narrow hardware scope**: Only SI family GPUs (12+ year old hardware,
  but still in use)
- **Platform-specific trigger**: Issues observed on Zen4 platforms
- **Well-contained fix**: Isolated to ASPM quirk handling code

### **4. Backport Suitability - Positive Indicators**

✅ **Fixes Critical Stability Bug**: Random system hangs are severe
issues affecting system usability

✅ **Minimal Change Size**: Only 7 lines added, zero lines removed -
extremely low complexity

✅ **No Dependencies**: Uses existing code infrastructure
(`AMDGPU_FAMILY_SI` constant, quirk pattern)

✅ **Conservative Fix**: Disables problematic feature rather than
attempting complex behavior changes

✅ **Low Regression Risk**:
- Only affects SI generation GPUs
- Disabling ASPM is safe (may slightly increase power consumption but
  prevents hangs)
- No code path changes for other GPU families

✅ **Follows Established Pattern**:
From git history analysis, found similar ASPM quirk commits:
- `c770ef19673fb` - "disable ASPM in some situations"
- `d9b3a066dfcd3` - "Exclude dGPUs in eGPU enclosures from DPM quirks"
- `2757a848cb0f1` - "Explicitly disable ASPM when dynamic switching
  disabled"

✅ **Clear Hardware Scope**: Specifically targets well-defined hardware
(SI family)

✅ **Stable Tree Compliant**:
- Pure bug fix, not a feature addition
- No architectural changes
- Fixes user-visible problem

### **5. Backport Suitability - Considerations**

⚠️ **No Explicit Cc: stable Tag**: Commit lacks `Cc:
stable@vger.kernel.org` tag
- This may be intentional (let it bake in mainline first)
- Or could be an oversight given the severity

⚠️ **Root Cause Uncertainty**: Comment states "unclear if platform-
specific or GPU-specific issue"
- However, the conservative fix (disable ASPM) is appropriate regardless
- This uncertainty doesn't affect fix correctness

⚠️ **Affects Legacy Hardware**: SI GPUs from 2012
- But users still run this hardware on modern Zen4 platforms
- Hangs make systems unusable, so fix is valuable

### **6. Technical Correctness**

**Function Behavior:**
- `amdgpu_device_aspm_support_quirk()` returns `true` when ASPM should
  be **disabled** (quirk detected)
- Adding SI family check causes function to return `true` early
- This propagates through `amdgpu_device_should_use_aspm()` which
  returns `false` when quirk detected
- Result: `si_program_aspm()` exits early without programming ASPM
  registers

**Semantic Verification:**
Checked the function at commit `7a899fb4de29c` (before this change) - it
only handled Intel Alder Lake/Raptor Lake quirks. The new SI quirk
follows identical pattern.

### **7. Historical Context**

From git log analysis:
- First appeared in v6.18-rc2 (October 2025)
- Part of ongoing ASPM quirk management in AMD GPU driver
- Follows 2+ year history of ASPM-related fixes in this subsystem
- Reviewed by Alex Deucher (AMD maintainer)

### **8. Final Assessment**

**BACKPORT: YES**

**Justification:**
1. **Critical stability fix** resolving random system hangs
2. **Minimal risk** due to small, contained change
3. **Clear benefit** for users with SI GPUs on Zen4 platforms
4. **Proper technical approach** using established quirk infrastructure
5. **No side effects** on other hardware generations
6. **Conservative solution** that safely disables problematic feature

**Recommendation:** This commit should be backported to all active
stable kernel trees that support SI generation AMD GPUs (likely 5.x and
6.x stable series). The fix is low-risk and provides significant
stability improvements for affected users.

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c8459337fcb89..e01264d644701 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1879,6 +1879,13 @@ static bool amdgpu_device_pcie_dynamic_switching_supported(struct amdgpu_device
 
 static bool amdgpu_device_aspm_support_quirk(struct amdgpu_device *adev)
 {
+	/* Enabling ASPM causes randoms hangs on Tahiti and Oland on Zen4.
+	 * It's unclear if this is a platform-specific or GPU-specific issue.
+	 * Disable ASPM on SI for the time being.
+	 */
+	if (adev->family == AMDGPU_FAMILY_SI)
+		return true;
+
 #if IS_ENABLED(CONFIG_X86)
 	struct cpuinfo_x86 *c = &cpu_data(0);
 
-- 
2.51.0


  parent reply	other threads:[~2025-10-28  0:40 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-28  0:38 [PATCH AUTOSEL 6.17-6.1] smb/server: fix possible memory leak in smb2_read() Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.4] NFS4: Fix state renewals missing after boot Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.12] drm/amdgpu: remove two invalid BUG_ON()s Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.15] NFS: check if suid/sgid was cleared after a write as needed Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.12] HID: logitech-hidpp: Add HIDPP_QUIRK_RESET_HI_RES_SCROLL Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.4] ASoC: max98090/91: fixed max98091 ALSA widget powering up/down Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] ALSA: hda/realtek: Fix mute led for HP Omen 17-cb0xxx Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.10] RISC-V: clear hot-unplugged cores from all task mm_cpumasks to avoid rfence errors Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] ASoC: nau8821: Avoid unnecessary blocking in IRQ handler Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.4] HID: quirks: avoid Cooler Master MM712 dongle wakeup bug Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] drm/amdkfd: fix suspend/resume all calls in mes based eviction path Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.12] exfat: fix improper check of dentry.stream.valid_size Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] io_uring: fix unexpected placement on same size resizing Sasha Levin
2025-10-28  0:38 ` Sasha Levin [this message]
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.6] riscv: acpi: avoid errors caused by probing DT devices when ACPI is used Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.1] drm/amd/pm: Disable MCLK switching on SI at high pixel clocks Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.12] drm/amdgpu: hide VRAM sysfs attributes on GPUs without VRAM Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.12] NFS4: Apply delay_retrans to async operations Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.1] drm/amdgpu: Fix NULL pointer dereference in VRAM logic for APU devices Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] ixgbe: handle IXGBE_VF_FEATURES_NEGOTIATE mbox cmd Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] ixgbe: handle IXGBE_VF_GET_PF_LINK_STATE mailbox operation Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.6] HID: quirks: Add ALWAYS_POLL quirk for VRS R295 steering wheel Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] HID: intel-thc-hid: intel-quickspi: Add ARL PCI Device Id's Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.12] HID: nintendo: Wait longer for initial probe Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.1] smb/server: fix possible refcount leak in smb2_sess_setup() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251028003940.884625-14-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Hawking.Zhang@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexandre.f.demers@gmail.com \
    --cc=cesun102@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=mario.limonciello@amd.com \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=timur.kristof@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).