[PATCH AUTOSEL 7.0] drm/amd/pm: Avoid overflow when sorting pp_feature list

public inbox for dri-devel@lists.freedesktop.org
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Asad Kamal <asad.kamal@amd.com>, Lijo Lazar <lijo.lazar@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Sasha Levin <sashal@kernel.org>,
	kenneth.feng@amd.com, christian.koenig@amd.com,
	airlied@gmail.com, simona@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0] drm/amd/pm: Avoid overflow when sorting pp_feature list
Date: Mon, 20 Apr 2026 09:20:41 -0400	[thread overview]
Message-ID: <20260420132314.1023554-247-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Asad Kamal <asad.kamal@amd.com>

[ Upstream commit 8e8f6bda8a84f41c4002bca44ac89a5b3f8c7df2 ]

pp_features sorting uses int8_t sort_feature[] to store driver
feature enum indices. On newer ASICs the enum index can exceed 127,
causing signed overflow and silently dropping entries from the output.
Switch the array to int16_t so all enum indices are preserved.

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a complete analysis. Let me
compile the full walkthrough.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: Subject Line
- **Subsystem:** `drm/amd/pm` (AMD GPU power management)
- **Action verb:** "Avoid" (preventing a bug)
- **Summary:** Avoid integer overflow in `sort_feature[]` array when
  enum indices exceed `int8_t` max (127)

### Step 1.2: Tags
- `Signed-off-by: Asad Kamal <asad.kamal@amd.com>` - AMD employee,
  author
- `Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>` - AMD GPU PM subsystem
  reviewer
- `Signed-off-by: Alex Deucher <alexander.deucher@amd.com>` - AMD DRM
  maintainer
- No Fixes: tag, no Cc: stable, no Reported-by (expected for autosel
  candidates)

### Step 1.3: Commit Body
Bug: `int8_t sort_feature[]` stores enum indices that can exceed 127 on
newer ASICs. Signed overflow wraps values to negative, and the
subsequent `< 0` check silently drops those entries from the sysfs
output. Fix: widen to `int16_t`.

### Step 1.4: Hidden Bug Fix Detection
This is explicitly described as an overflow fix. Not hidden.

## PHASE 2: DIFF ANALYSIS

### Step 2.1: Inventory
- 1 file changed: `drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c`
- 1 line added, 1 line removed
- Function modified: `smu_cmn_get_pp_feature_mask()`

### Step 2.2: Code Flow
**Before:** `int8_t sort_feature[MAX(SMU_FEATURE_COUNT,
SMU_FEATURE_MAX)]` - can hold values -128 to 127.
**After:** `int16_t sort_feature[MAX(SMU_FEATURE_COUNT,
SMU_FEATURE_MAX)]` - can hold values -32768 to 32767.

The array is initialized to `-1` via `memset(sort_feature, -1,
sizeof(...))`, then populated with enum index `i` (0 to
`SMU_FEATURE_COUNT-1`). Entries remaining `-1` are skipped via the `< 0`
check. With `int8_t`, any `i >= 128` overflows to a negative value,
falsely triggering the skip.

### Step 2.3: Bug Mechanism
**Integer overflow / type bug.** `SMU_FEATURE_COUNT = 135` (verified by
counting enum entries). Indices 128-134 (7 features: `APT_SQ_THROTTLE`,
`APT_PF_DCS`, `GFX_EDC_XVMIN`, `GFX_DIDT_XVMIN`, `FAN_ABNORMAL`, `PIT`,
`HROM_EN`) overflow `int8_t`, wrapping to negative values and being
silently dropped.

### Step 2.4: Fix Quality
- Obviously correct: widening the type eliminates the overflow
- `memset(-1)` still works correctly: fills all bytes with `0xFF`,
  making each `int16_t` element `0xFFFF = -1` in two's complement
  (confirmed by the author in review discussion and correct by C
  standard)
- No regression risk: the type widening is strictly safe; no logic
  changes
- Minimal and surgical: 1-line change

## PHASE 3: GIT HISTORY

### Step 3.1: Blame
The `int8_t` type was introduced in commit `6f73d6762694c` ("drm/amd/pm:
optimize the interface for dpm feature status query", dated 2022-05-25,
by Evan Quan). Originally (commit `7dbf78051f75f1`, 2020), the array was
`uint32_t sort_feature[SMU_FEATURE_COUNT]` with no overflow possibility.
The refactoring in 6f73d6762694c downsized the type to `int8_t` (using
`-1` as sentinel).

### Step 3.2: Fixes: tag
No Fixes: tag present. The logical "Fixes:" would be `6f73d6762694c`
(introduced `int8_t`) + `25d48f2eb0af1` (pushed enum count past 127).

### Step 3.3: Related Changes
Recent changes to `smu_cmn.c` include significant refactoring of feature
mask handling (`7b88453a476c9` etc.), but none address this specific
overflow.

### Step 3.4: Author
Asad Kamal is an AMD employee who regularly contributes to `drm/amd/pm`.
Multiple recent commits in the subsystem.

### Step 3.5: Dependencies
No dependencies. The fix is self-contained.

## PHASE 4: MAILING LIST DISCUSSION

### Step 4.1: Original Submission
Found via `b4 dig -c 8e8f6bda8a84f`:
https://patch.msgid.link/20260302061242.3062232-1-asad.kamal@amd.com

### Step 4.2: Review Discussion
- **Lijo Lazar** (AMD reviewer): Gave `Reviewed-by` immediately
- **Kevin Wang** raised a concern about `memset(-1)` correctness with
  `int16_t` — asking whether it would correctly initialize all elements
  to `-1`
- **Asad Kamal** correctly explained: "memset fills all bytes with 0xFF.
  For int16_t, that becomes 0xFFFF, which is -1 in two's complement."
- **Kevin Wang** accepted the explanation: "Based on private
  discussions, please continue to submit the code."
- No NAKs, no concerns about the fix itself, only a clarification
  question that was satisfactorily resolved.

### Step 4.3-4.5: No external bug reports. No stable-specific discussion
found.

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1-5.2: Call Chain
`smu_cmn_get_pp_feature_mask()` is called via sysfs: user reads
`pp_features` -> `amdgpu_pm.c:amdgpu_dpm_get_ppfeature_status()` ->
`smu_sys_get_pp_feature_mask()` -> `smu_cmn_get_pp_feature_mask()`. Used
by **17 different GPU backends** (verified: SMU v11, v12, v13, v14, v15
variants).

### Step 5.4: User Reachability
Directly reachable from userspace via sysfs read. Any user or monitoring
tool reading GPU feature status triggers this code.

## PHASE 6: STABLE TREE ANALYSIS

### Step 6.1: Buggy Code in Stable Trees
- `int8_t` introduced in `6f73d6762694c` (v5.19/v6.0 era)
- Overflow-triggering features added in `25d48f2eb0af1` (v6.12)
- **The overflow is triggerable in v6.12+ stable trees** where both the
  `int8_t` type and the >127 enum count coexist
- For older stable trees (6.6.y, 6.1.y), SMU_FEATURE_COUNT is still <
  128, so no overflow yet — but future backported features could trigger
  it

### Step 6.2: Backport Difficulty
Clean apply expected — the change is a single-line type change with no
context dependencies.

## PHASE 7: SUBSYSTEM CONTEXT

### Step 7.1: Subsystem
- **drm/amd/pm** — AMD GPU power management, IMPORTANT criticality
- Affects all users with AMD GPUs using swSMU (modern AMD GPUs: RDNA2+,
  CDNA)

### Step 7.2: Activity
Very actively developed — many recent commits.

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: Affected Users
All users with AMD GPUs running SMU v14.0.2/3 or newer (Radeon RX 8000
series and similar), or any ASIC whose feature mapping exceeds index
127.

### Step 8.2: Trigger Conditions
- **Trigger:** Any read of `pp_features` sysfs node
- **Frequency:** Common — monitoring tools, manual inspection, power
  management tools read this
- **Unprivileged:** Yes, sysfs readable by any user

### Step 8.3: Severity
- **MEDIUM:** Incorrect/incomplete sysfs output. Not a crash or security
  issue, but features are silently dropped, making power management
  monitoring unreliable.

### Step 8.4: Risk-Benefit
- **Benefit:** Fixes incorrect sysfs output for AMD GPU users; prevents
  silent data loss in feature reporting
- **Risk:** Extremely low — 1-line type change, no logic modification,
  correctness of `memset(-1)` with `int16_t` verified in review and
  mathematically sound
- **Stack increase:** `sort_feature` grows from 128 bytes
  (`int8_t[128]`) to 256 bytes (`int16_t[128]`); negligible for a stack-
  allocated array

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: Evidence Summary
**FOR backporting:**
- Fixes a real, verifiable integer overflow bug
- SMU_FEATURE_COUNT = 135 > 127, confirmed to overflow `int8_t`
- 7 power management features silently dropped from sysfs output
- 1-line fix, obviously correct, minimal risk
- Reviewed by AMD engineer, no objections after clarification
- Used by 17 GPU backends across all modern AMD GPUs
- Signed off by Alex Deucher (AMD DRM maintainer)

**AGAINST backporting:**
- Not a crash or security issue (incorrect output only)
- Only affects v6.12+ trees where enum count exceeds 127
- No Fixes: tag or Cc: stable

### Step 9.2: Stable Rules Checklist
1. **Obviously correct and tested?** YES — trivial type widening,
   reviewed
2. **Fixes a real bug?** YES — integer overflow causing features to be
   silently dropped
3. **Important issue?** MEDIUM — not crash/security, but correctness bug
   in user-visible output
4. **Small and contained?** YES — 1 line, 1 file
5. **No new features or APIs?** Correct — no new features
6. **Can apply to stable?** YES — clean apply expected

### Step 9.3: Exception Categories
None applicable — this is a standard bug fix.

### Step 9.4: Decision
This is a clean, minimal, well-reviewed bug fix for a verifiable integer
overflow that causes incorrect user-visible behavior on modern AMD GPUs.
It meets all stable criteria.

## Verification

- [Phase 1] Parsed tags: Reviewed-by Lijo Lazar, SOB from Alex Deucher
  (AMD maintainer)
- [Phase 2] Diff analysis: 1 line changed, `int8_t` -> `int16_t` in
  `smu_cmn_get_pp_feature_mask()`
- [Phase 2] Verified SMU_FEATURE_COUNT = 135 by counting enum entries in
  `smu_types.h` — 7 features exceed index 127
- [Phase 3] git blame: `int8_t` introduced in `6f73d6762694c`
  (v5.19/v6.0 era, Evan Quan, 2022)
- [Phase 3] Overflow-triggering features added in `25d48f2eb0af1`
  (v6.12, 2024-09-10)
- [Phase 3] Original type was `uint32_t` in `7dbf78051f75f1` (2020) — no
  overflow possible
- [Phase 4] b4 dig -c found submission:
  https://patch.msgid.link/20260302061242.3062232-1-asad.kamal@amd.com
- [Phase 4] b4 dig -w: AMD team members CC'd (lijo.lazar, hawking.zhang,
  le.ma, alexander.deucher, kevinyang.wang)
- [Phase 4] Review discussion decoded from base64: Kevin Wang raised
  memset concern, Asad explained correctly, Kevin approved
- [Phase 5] Traced call chain: sysfs read ->
  `amdgpu_dpm_get_ppfeature_status()` -> `smu_sys_get_pp_feature_mask()`
  -> target function
- [Phase 5] Verified 17 GPU backends use this function (SMU v11, v12,
  v13, v14, v15)
- [Phase 6] Bug triggerable in v6.12+ (both int8_t type and >127 enum
  present)
- [Phase 8] Severity: MEDIUM (incorrect sysfs output, not
  crash/security)
- [Phase 8] Risk: Very low (1-line type change, no logic change, stack
  grows by 128 bytes)

**YES**

 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index 6fd50c2fd20e0..97ed66cb47472 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -880,7 +880,7 @@ static const char *smu_get_feature_name(struct smu_context *smu,
 size_t smu_cmn_get_pp_feature_mask(struct smu_context *smu,
 				   char *buf)
 {
-	int8_t sort_feature[MAX(SMU_FEATURE_COUNT, SMU_FEATURE_MAX)];
+	int16_t sort_feature[MAX(SMU_FEATURE_COUNT, SMU_FEATURE_MAX)];
 	struct smu_feature_bits feature_mask;
 	uint32_t features[2];
 	int i, feature_index;
-- 
2.53.0

next prev parent reply	other threads:[~2026-04-20 13:31 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: fix DF NULL pointer issue for soc24 Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.18] drm/ttm: Avoid invoking the OOM killer when reading back swapped content Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 6.18] drm/vc4: Release runtime PM reference after binding V3D Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: remove duplicate format modifier Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: unlock cancel_delayed_work_sync for hang_detect_work Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.1] drm/amd/display: Merge pipes for validate Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix a memory leak in hang state error path Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Protect madv read in vc4_gem_object_mmap() with madv_lock Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Fix cursor pos at overlay plane edges on DCN4 Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.1] drm/msm/dpu: fix vblank IRQ registration before atomic_mode_set Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 6.18] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amd/display: bios_parser: fix GPIO I2C line off-by-one Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Handle IH v7_1 reg offset differences Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/vcn4.0.3: gate per-queue reset by PSP SOS program version Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/imx: parallel-display: add DRM_DISPLAY_HELPER for DRM_IMX_PARALLEL_DISPLAY Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix amdgpu_userq_evict Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amdgpu: validate fence_count in wait_fences ioctl Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.6] drm/amdgpu: fix shift-out-of-bounds when updating umc active mask Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: remove queue from doorbell xa during clean up Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdkfd: fix kernel crash on releasing NULL sysfs entry Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: clear related counter after RAS eeprom reset Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Restore full update for tiling change to linear Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0] drm/amdgpu: fix array out of bounds accesses for mes sw_fini Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Exit IPS w/ DC helper for all dc_set_power_state cases Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: Check for multiplication overflow in checkpoint stack size Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/prime: Limit scatter list size with dedicated DMA device Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Clamp dc_cursor_position x_hotspot to prevent integer overflow Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: defer queue publication until create completes Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/userq: fix dma_fence refcount underflow in userq path Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: guard atom_context in devcoredump VBIOS dump Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amd/display: Avoid turning off the PHY when OTG is running for DVI Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Revert setting up Retry based Thrashing on GFX 12.1 Sasha Levin
2026-04-20 13:20 ` Sasha Levin [this message]
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Fix number of opp Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/panel-edp: Change BOE NV140WUM-N64 timings Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amd/display: Fix HWSS v3 fast path determination Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] drm/mediatek: mtk_dsi: enable hs clock during pre-enable Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix memory leak of BO array in hang state Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Remove invalid DPSTREAMCLK mask usage Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add CMN N116BCL-EAK (C2) Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Add default reset method for soc_v1_0 Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: cleanup amdgpu_userq_get/put where not needed Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix some more bug in amdgpu_gem_va_ioctl Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] fbdev: omap2: fix inconsistent lock returns in omapfb_mmap Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm: gpu: msm: forbid mem reclaim from reset Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add AUO B116XAT04.1 (HW: 1A) Sasha Levin
2026-04-20 13:22 ` [PATCH AUTOSEL 7.0-6.6] drm/gem-dma: set VM_DONTDUMP for mmap Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:6fd50c2fd20e dfblob:97ed66cb4747 )
 OR (
bs:"[PATCH AUTOSEL 7.0] drm/amd/pm: Avoid overflow when sorting pp_feature list" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260420132314.1023554-247-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=asad.kamal@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=kenneth.feng@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=simona@ffwll.ch \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox