patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: "Jesse.Zhang" <Jesse.Zhang@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Lijo Lazar" <lijo.lazar@amd.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Sasha Levin" <sashal@kernel.org>,
	srinivasan.shanmugam@amd.com, sunil.khatri@amd.com,
	Arunpravin.PaneerSelvam@amd.com, Tong.Liu01@amd.com,
	tvrtko.ursulin@igalia.com, alexandre.f.demers@gmail.com,
	mario.limonciello@amd.com, Prike.Liang@amd.com,
	shashank.sharma@amd.com, vitaly.prosyak@amd.com,
	Victor.Skvortsov@amd.com, Hawking.Zhang@amd.com,
	Shravankumar.Gande@amd.com, mtodorovac69@gmail.com,
	xiang.liu@amd.com, shaoyun.liu@amd.com, Tony.Yi@amd.com
Subject: [PATCH AUTOSEL 6.17-6.1] drm/amdgpu: Fix NULL pointer dereference in VRAM logic for APU devices
Date: Mon, 27 Oct 2025 20:39:04 -0400	[thread overview]
Message-ID: <20251028003940.884625-20-sashal@kernel.org> (raw)
In-Reply-To: <20251028003940.884625-1-sashal@kernel.org>

From: "Jesse.Zhang" <Jesse.Zhang@amd.com>

[ Upstream commit 883f309add55060233bf11c1ea6947140372920f ]

Previously, APU platforms (and other scenarios with uninitialized VRAM managers)
triggered a NULL pointer dereference in `ttm_resource_manager_usage()`. The root
cause is not that the `struct ttm_resource_manager *man` pointer itself is NULL,
but that `man->bdev` (the backing device pointer within the manager) remains
uninitialized (NULL) on APUs—since APUs lack dedicated VRAM and do not fully
set up VRAM manager structures. When `ttm_resource_manager_usage()` attempts to
acquire `man->bdev->lru_lock`, it dereferences the NULL `man->bdev`, leading to
a kernel OOPS.

1. **amdgpu_cs.c**: Extend the existing bandwidth control check in
   `amdgpu_cs_get_threshold_for_moves()` to include a check for
   `ttm_resource_manager_used()`. If the manager is not used (uninitialized
   `bdev`), return 0 for migration thresholds immediately—skipping VRAM-specific
   logic that would trigger the NULL dereference.

2. **amdgpu_kms.c**: Update the `AMDGPU_INFO_VRAM_USAGE` ioctl and memory info
   reporting to use a conditional: if the manager is used, return the real VRAM
   usage; otherwise, return 0. This avoids accessing `man->bdev` when it is
   NULL.

3. **amdgpu_virt.c**: Modify the vf2pf (virtual function to physical function)
   data write path. Use `ttm_resource_manager_used()` to check validity: if the
   manager is usable, calculate `fb_usage` from VRAM usage; otherwise, set
   `fb_usage` to 0 (APUs have no discrete framebuffer to report).

This approach is more robust than APU-specific checks because it:
- Works for all scenarios where the VRAM manager is uninitialized (not just APUs),
- Aligns with TTM's design by using its native helper function,
- Preserves correct behavior for discrete GPUs (which have fully initialized
  `man->bdev` and pass the `ttm_resource_manager_used()` check).

v4: use ttm_resource_manager_used(&adev->mman.vram_mgr.manager) instead of checking the adev->gmc.is_app_apu flag (Christian)

Reviewed-by: Christian König <christian.koenig@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## BACKPORT RECOMMENDATION: **YES**

### Analysis Summary

I performed comprehensive semantic code analysis using the available MCP
tools to evaluate this NULL pointer dereference fix for APU devices.

---

### 1. SEMANTIC TOOLS USED

**Function Analysis:**
- `mcp__semcode__find_function`: Located all 3 affected functions and
  `ttm_resource_manager_usage()`
- `mcp__semcode__find_callers`: Traced call graphs to determine user-
  space exposure
- `mcp__semcode__find_type`: Examined `ttm_resource_manager` structure
  to understand the root cause

**Call Graph Analysis Results:**
- `amdgpu_info_ioctl`: **0 callers** (it's a top-level ioctl handler) →
  **DIRECTLY USER-SPACE EXPOSED**
  (drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:613)
- `amdgpu_cs_get_threshold_for_moves`: Called by `amdgpu_cs_parser_bos`
  → called by `amdgpu_cs_ioctl` → **USER-SPACE EXPOSED** via command
  submission ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:702)
- `amdgpu_virt_write_vf2pf_data`: Called by SRIOV virtualization code →
  potentially **USER-SPACE TRIGGERABLE** in virtualized environments
  (drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c:576)
- `ttm_resource_manager_usage`: **18 callers across multiple drivers**
  (amdgpu, radeon, nouveau, xe)

---

### 2. ROOT CAUSE ANALYSIS

The bug occurs in `ttm_resource_manager_usage()` at
drivers/gpu/drm/ttm/ttm_resource.c:586-594:

```c
uint64_t ttm_resource_manager_usage(struct ttm_resource_manager *man)
{
    uint64_t usage;
    spin_lock(&man->bdev->lru_lock);  // ← NULL DEREFERENCE HERE
    usage = man->usage;
    spin_unlock(&man->bdev->lru_lock);
    return usage;
}
```

**Why it happens:** On APU devices, the VRAM manager structure exists
but `man->bdev` (backing device pointer) is **NULL** because APUs don't
have dedicated VRAM and don't fully initialize VRAM manager structures.
The `ttm_resource_manager_used()` check returns false when
`man->use_type` is false, indicating the manager is not actually in use.

---

### 3. USER-SPACE EXPOSURE & IMPACT SCOPE

**CRITICAL FINDING:** All three affected code paths are user-space
triggerable:

1. **amdgpu_kms.c:760** (`AMDGPU_INFO_VRAM_USAGE` ioctl case):
   - Any userspace program can call this ioctl to query VRAM usage
   - On APUs, this triggers NULL deref → **KERNEL CRASH**

2. **amdgpu_cs.c:711** (command submission path):
   - Called during GPU command buffer submission
   - Normal GPU applications (games, compute workloads) trigger this
   - On APUs, attempting to use GPU triggers NULL deref → **KERNEL
     CRASH**

3. **amdgpu_virt.c:601** (SRIOV path):
   - Affects virtualized APU environments
   - Less common but still user-triggerable

**Affected Platforms:** All AMD APU devices (Ryzen with integrated
graphics, etc.) - **widely deployed hardware**

---

### 4. FIX COMPLEXITY & DEPENDENCIES

**Fix Complexity:** **VERY SIMPLE**
- Only adds conditional checks:
  `ttm_resource_manager_used(&adev->mman.vram_mgr.manager) ? ... : 0`
- No behavioral changes for discrete GPUs
- No new functions or data structures
- Changes span only 3 files, 3 locations

**Dependency Analysis:**
```c
static inline bool ttm_resource_manager_used(struct ttm_resource_manager
*man)
{
    return man->use_type;
}
```
This function has existed since **August 2020** (commit b2458726b38cb)
when TTM resource management was refactored. It's available in all
stable kernels that would be backport candidates.

---

### 5. SEMANTIC CHANGE ASSESSMENT

**Code Changes Analysis:**

1. **amdgpu_cs.c:711** - Extends existing early-return check:
  ```c
   - if (!adev->mm_stats.log2_max_MBps) {
   + if ((!adev->mm_stats.log2_max_MBps) ||
!ttm_resource_manager_used(&adev->mman.vram_mgr.manager)) {
   ```
   **Effect:** Returns 0 for migration thresholds on APUs (correct
behavior - no VRAM to migrate)

2. **amdgpu_kms.c:760 & 807** - Conditional usage query:
  ```c
   - ui64 = ttm_resource_manager_usage(&adev->mman.vram_mgr.manager);
   + ui64 = ttm_resource_manager_used(&adev->mman.vram_mgr.manager) ?
   +     ttm_resource_manager_usage(&adev->mman.vram_mgr.manager) : 0;
   ```
   **Effect:** Reports 0 VRAM usage for APUs (correct - APUs have no
dedicated VRAM)

3. **amdgpu_virt.c:601** - Similar conditional for fb_usage reporting

**Side Effects:** NONE for discrete GPUs. APUs now correctly report 0
instead of crashing.

---

### 6. BACKPORT SUITABILITY CRITERIA

✅ **Fixes important bug:** NULL pointer dereference → kernel crash
✅ **Affects users:** All AMD APU users (widespread hardware)
✅ **User-space triggerable:** Via standard ioctl calls
✅ **Small and contained:** 3 files, simple conditional checks
✅ **No architectural changes:** Uses existing TTM infrastructure
✅ **Low regression risk:** Only affects APUs, changes are defensive
✅ **No new features:** Pure bug fix
✅ **Dependencies satisfied:** Helper function available since 2020
✅ **Confined to subsystem:** Only affects amdgpu driver

---

### 7. STABLE TREE COMPLIANCE

**Bug Fix:** ✅ YES - Fixes NULL pointer dereference
**New Feature:** ❌ NO - Only defensive checks
**Security Fix:** ⚠️ POTENTIAL - DoS via user-triggerable kernel crash
**Performance Optimization:** ❌ NO
**Architectural Change:** ❌ NO

---

### 8. MISSING TAGS OBSERVATION

**Notable:** This commit lacks standard stable tree indicators:
- No `Fixes:` tag (should reference when the bug was introduced)
- No `Cc: stable@vger.kernel.org` tag

However, the commit message clearly describes a NULL pointer dereference
fix, which is a critical bug that warrants backporting regardless of
missing tags.

---

### FINAL RECOMMENDATION

**BACKPORT: YES** - This is a **HIGH PRIORITY** backport candidate
because:

1. **Severity:** Kernel crash (NULL pointer dereference) affecting all
   APU users
2. **Exposure:** User-space triggerable via common ioctl paths
3. **Risk:** Low - simple defensive checks with no behavioral changes
   for working systems
4. **Scope:** Affects widely-deployed AMD APU hardware
5. **Fix Quality:** Clean, minimal, uses existing TTM infrastructure

The fix should be backported to all active stable kernel trees
supporting AMD APU devices (likely 5.10+).

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  | 7 ++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 4 ++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index d3f220be2ef9a..2a142e9e97384 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -714,7 +714,7 @@ static void amdgpu_cs_get_threshold_for_moves(struct amdgpu_device *adev,
 	 */
 	const s64 us_upper_bound = 200000;
 
-	if (!adev->mm_stats.log2_max_MBps) {
+	if ((!adev->mm_stats.log2_max_MBps) || !ttm_resource_manager_used(&adev->mman.vram_mgr.manager)) {
 		*max_bytes = 0;
 		*max_vis_bytes = 0;
 		return;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 8a76960803c65..8162f7f625a86 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -758,7 +758,8 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
 		ui64 = atomic64_read(&adev->num_vram_cpu_page_faults);
 		return copy_to_user(out, &ui64, min(size, 8u)) ? -EFAULT : 0;
 	case AMDGPU_INFO_VRAM_USAGE:
-		ui64 = ttm_resource_manager_usage(&adev->mman.vram_mgr.manager);
+		ui64 = ttm_resource_manager_used(&adev->mman.vram_mgr.manager) ?
+			ttm_resource_manager_usage(&adev->mman.vram_mgr.manager) : 0;
 		return copy_to_user(out, &ui64, min(size, 8u)) ? -EFAULT : 0;
 	case AMDGPU_INFO_VIS_VRAM_USAGE:
 		ui64 = amdgpu_vram_mgr_vis_usage(&adev->mman.vram_mgr);
@@ -804,8 +805,8 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
 		mem.vram.usable_heap_size = adev->gmc.real_vram_size -
 			atomic64_read(&adev->vram_pin_size) -
 			AMDGPU_VM_RESERVED_VRAM;
-		mem.vram.heap_usage =
-			ttm_resource_manager_usage(vram_man);
+		mem.vram.heap_usage = ttm_resource_manager_used(&adev->mman.vram_mgr.manager) ?
+				ttm_resource_manager_usage(vram_man) : 0;
 		mem.vram.max_allocation = mem.vram.usable_heap_size * 3 / 4;
 
 		mem.cpu_accessible_vram.total_heap_size =
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 13f0cdeb59c46..e13bf2345ef5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -598,8 +598,8 @@ static int amdgpu_virt_write_vf2pf_data(struct amdgpu_device *adev)
 	vf2pf_info->driver_cert = 0;
 	vf2pf_info->os_info.all = 0;
 
-	vf2pf_info->fb_usage =
-		ttm_resource_manager_usage(&adev->mman.vram_mgr.manager) >> 20;
+	vf2pf_info->fb_usage = ttm_resource_manager_used(&adev->mman.vram_mgr.manager) ?
+		 ttm_resource_manager_usage(&adev->mman.vram_mgr.manager) >> 20 : 0;
 	vf2pf_info->fb_vis_usage =
 		amdgpu_vram_mgr_vis_usage(&adev->mman.vram_mgr) >> 20;
 	vf2pf_info->fb_size = adev->gmc.real_vram_size >> 20;
-- 
2.51.0


  parent reply	other threads:[~2025-10-28  0:40 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-28  0:38 [PATCH AUTOSEL 6.17-6.1] smb/server: fix possible memory leak in smb2_read() Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.4] NFS4: Fix state renewals missing after boot Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.12] drm/amdgpu: remove two invalid BUG_ON()s Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.15] NFS: check if suid/sgid was cleared after a write as needed Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.12] HID: logitech-hidpp: Add HIDPP_QUIRK_RESET_HI_RES_SCROLL Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.4] ASoC: max98090/91: fixed max98091 ALSA widget powering up/down Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] ALSA: hda/realtek: Fix mute led for HP Omen 17-cb0xxx Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.10] RISC-V: clear hot-unplugged cores from all task mm_cpumasks to avoid rfence errors Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] ASoC: nau8821: Avoid unnecessary blocking in IRQ handler Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.4] HID: quirks: avoid Cooler Master MM712 dongle wakeup bug Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] drm/amdkfd: fix suspend/resume all calls in mes based eviction path Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.12] exfat: fix improper check of dentry.stream.valid_size Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] io_uring: fix unexpected placement on same size resizing Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] drm/amd: Disable ASPM on SI Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.6] riscv: acpi: avoid errors caused by probing DT devices when ACPI is used Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.1] drm/amd/pm: Disable MCLK switching on SI at high pixel clocks Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.12] drm/amdgpu: hide VRAM sysfs attributes on GPUs without VRAM Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.12] NFS4: Apply delay_retrans to async operations Sasha Levin
2025-10-28  0:39 ` Sasha Levin [this message]
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] ixgbe: handle IXGBE_VF_FEATURES_NEGOTIATE mbox cmd Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] ixgbe: handle IXGBE_VF_GET_PF_LINK_STATE mailbox operation Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.6] HID: quirks: Add ALWAYS_POLL quirk for VRS R295 steering wheel Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] HID: intel-thc-hid: intel-quickspi: Add ARL PCI Device Id's Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.12] HID: nintendo: Wait longer for initial probe Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.1] smb/server: fix possible refcount leak in smb2_sess_setup() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251028003940.884625-20-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Arunpravin.PaneerSelvam@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=Jesse.Zhang@amd.com \
    --cc=Prike.Liang@amd.com \
    --cc=Shravankumar.Gande@amd.com \
    --cc=Tong.Liu01@amd.com \
    --cc=Tony.Yi@amd.com \
    --cc=Victor.Skvortsov@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexandre.f.demers@gmail.com \
    --cc=christian.koenig@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=mario.limonciello@amd.com \
    --cc=mtodorovac69@gmail.com \
    --cc=patches@lists.linux.dev \
    --cc=shaoyun.liu@amd.com \
    --cc=shashank.sharma@amd.com \
    --cc=srinivasan.shanmugam@amd.com \
    --cc=stable@vger.kernel.org \
    --cc=sunil.khatri@amd.com \
    --cc=tvrtko.ursulin@igalia.com \
    --cc=vitaly.prosyak@amd.com \
    --cc=xiang.liu@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).