public inbox for dri-devel@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Hawking Zhang <Hawking.Zhang@amd.com>,
	Likun Gao <Likun.Gao@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Sasha Levin <sashal@kernel.org>,
	christian.koenig@amd.com, airlied@gmail.com, simona@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0-6.6] drm/amdgpu: fix shift-out-of-bounds when updating umc active mask
Date: Mon, 20 Apr 2026 09:18:35 -0400	[thread overview]
Message-ID: <20260420132314.1023554-121-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Hawking Zhang <Hawking.Zhang@amd.com>

[ Upstream commit 1394a4926f4bcff0dc6ac6deea5beeb2844297f0 ]

UMC node_inst_num can exceed 32, causing
(1 << node_inst_num) to shift a 32-bit int
out of bounds

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a comprehensive analysis. Let
me compile the findings.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1: Subject Line**
- Subsystem: `drm/amdgpu`
- Action verb: "fix"
- Summary: Fixes shift-out-of-bounds when computing UMC active mask
Record: [drm/amdgpu] [fix] [shift-out-of-bounds in UMC active mask
calculation]

**Step 1.2: Tags**
- Signed-off-by: Hawking Zhang (author), Alex Deucher (maintainer)
- Reviewed-by: Likun Gao
- No Fixes: tag, no Reported-by:, no Cc: stable (expected for a
  candidate commit)
Record: Patch reviewed by AMD colleague, signed off by amdgpu maintainer
Alex Deucher. No bug reporter or explicit stable nomination.

**Step 1.3: Commit Body**
The commit message is concise: `node_inst_num` can exceed 32, causing
`(1 << node_inst_num)` to shift a 32-bit int out of bounds. This is
undefined behavior in C.
Record: Bug = shift of a 32-bit `1` by >= 32 positions. Symptom =
undefined behavior, incorrect computation of `umc.active_mask`. Root
cause = using `int` literal `1` instead of `1ULL` for a shift that can
reach 32 or more.

**Step 1.4: Hidden Bug Fix Detection**
Not hidden at all - the subject says "fix" and the bug mechanism is
clearly stated.
Record: This is a straightforward bug fix for UB.

---

## PHASE 2: DIFF ANALYSIS

**Step 2.1: Inventory**
- 1 file changed: `drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c`
- 2 lines changed: `uint32_t` -> `u64` (variable type) and `1` -> `1ULL`
  (shift operand)
- Function modified: `amdgpu_discovery_read_from_harvest_table()`
Record: Single-file, 2-line surgical fix. Scope is minimal.

**Step 2.2: Code Flow Change**

Hunk 1 (line 777 equivalent):
- Before: `uint32_t umc_harvest_config = 0;` (32-bit variable)
- After: `u64 umc_harvest_config = 0;` (64-bit variable)
- Purpose: Allow storing harvest config bits for node instances >= 32

Hunk 2 (line 833):
- Before: `((1 << adev->umc.node_inst_num) - 1)` — `1` is `int` (32
  bits); shifting by >= 32 is UB
- After: `((1ULL << adev->umc.node_inst_num) - 1ULL)` — `1ULL` is
  `unsigned long long` (64 bits); safe for node_inst_num up to 63

Record: The fix widens both the intermediate shift result and the
accumulation variable to 64 bits, eliminating the UB.

**Step 2.3: Bug Mechanism**
This is category (f) **type/correctness fix** — specifically, a shift-
out-of-bounds / undefined behavior fix. In C, shifting an `int` by >=
its bit width (32) is undefined behavior per the standard. The result is
unpredictable and could yield an incorrect `active_mask`, which is used
to track which UMC (memory controller) instances are active.
Record: [Type/UB bug] [32-bit shift by >= 32 causes UB; fix uses 64-bit
types]

**Step 2.4: Fix Quality**
- Obviously correct: widening types to match the range of possible
  values is textbook UB fix
- Minimal/surgical: 2 lines
- Regression risk: extremely low — only changes type widths;
  `active_mask` is already `unsigned long` (64 bits on 64-bit systems)
Record: Fix is obviously correct, minimal, with near-zero regression
risk.

---

## PHASE 3: GIT HISTORY

**Step 3.1: Blame**
From git blame, the buggy code at lines 777 and 833 was introduced by
commit `2b595659d5aec7` (Candice Li, Feb 2023) — "drm/amdgpu: Support
umc node harvest config on umc v8_10". This commit was first included in
v6.4.
Record: Bug introduced in v6.4, present in all stable trees since
(6.6.y, 6.12.y, etc.).

**Step 3.2: Original Buggy Commit**
Verified via `git merge-base --is-ancestor`: commit 2b595659d5aec7 is
NOT in v6.1 or v6.3, but IS in v6.4 and v6.6.
Record: Bug exists in stable trees 6.4+, 6.6+. NOT in 6.1.y.

**Step 3.3: File History**
Recent changes to the file are mostly kmalloc refactoring (tree-wide
changes) and an IP block addition. No conflicting fixes for this
specific issue.
Record: Standalone fix, no prerequisites needed.

**Step 3.4: Author**
Hawking Zhang is a prolific AMD GPU contributor with 10+ recent commits
to the amdgpu subsystem, working on IP blocks, initialization, and RAS
features. He is an AMD engineer and a core contributor to this
subsystem.
Record: Author is a core amdgpu developer at AMD.

**Step 3.5: Dependencies**
The diff context shows `amdgpu_discovery_get_table_info()` and `struct
table_info *info`, which are NOT present in the 7.0 tree (which uses
`struct binary_header *bhdr` and direct access). The actual fix lines
(`uint32_t` -> `u64` and `1` -> `1ULL`) are present in both versions.
Record: Minor context differences for backport, but the fix itself is
trivially adaptable.

---

## PHASE 4: MAILING LIST RESEARCH

**Step 4.1-4.2:** b4 dig could not find the original buggy commit on
lore (AMD GPU patches often go through freedesktop.org/amd-gfx list
rather than lore). Web search found related shift-out-of-bounds fixes in
the amdgpu subsystem but not the exact commit being analyzed — it may be
very recent (2026).
Record: Could not find the exact patch thread. This is common for AMD
GPU patches which flow through the amd-gfx list.

**Step 4.3-4.5:** No bug reports or stable-specific discussions found
for this exact issue.
Record: No external bug reports found.

---

## PHASE 5: CODE SEMANTIC ANALYSIS

**Step 5.1: Key Functions**
Modified function: `amdgpu_discovery_read_from_harvest_table()`

**Step 5.2-5.3: Impact Surface**
`adev->umc.active_mask` is used by:
1. `LOOP_UMC_NODE_INST()` macro — iterates over active UMC nodes for RAS
   error counting
2. `amdgpu_umc_loop_all_aid()` — iterates over UMC instances for RAS
   queries
3. `amdgpu_psp.c` — passed to PSP firmware as `active_umc_mask`

An incorrect `active_mask` could cause:
- Missing or incorrect RAS error reporting
- Wrong UMC instances being queried for errors
- Incorrect firmware configuration
Record: active_mask affects RAS error handling and firmware
configuration.

**Step 5.4: Call Chain**
`amdgpu_discovery_read_from_harvest_table()` is called during GPU
initialization (probe path). This is a one-time setup function, but its
result persists for the lifetime of the driver.
Record: Called during init, result affects ongoing UMC/RAS operations.

---

## PHASE 6: STABLE TREE ANALYSIS

**Step 6.1:** The buggy code was introduced in v6.4 (commit
2b595659d5aec7). It exists in stable trees 6.6.y and later.
Record: Bug exists in 6.6.y, 6.12.y, and 7.0.y.

**Step 6.2:** The patch context differs slightly between the diff and
the 7.0 tree (helper function refactoring). The actual fix lines apply
conceptually with minor context adjustment.
Record: May need minor context adaptation for clean apply.

---

## PHASE 7: SUBSYSTEM CONTEXT

**Step 7.1:** drm/amdgpu is an IMPORTANT subsystem (widely used GPU
driver on AMD hardware).
**Step 7.2:** Very actively developed.
Record: [IMPORTANT] [Very active subsystem]

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

**Step 8.1:** Affects users of AMD GPUs with >= 32 UMC node instances
(large server/datacenter GPUs like MI300 series, where `node_inst_num`
can reach 32+).
Record: Driver-specific, primarily affects large AMD datacenter GPUs.

**Step 8.2:** Triggers during GPU initialization when the hardware has
>= 32 UMC instances. Deterministic, not a race condition.
Record: Deterministic trigger on specific hardware configurations.

**Step 8.3:** The undefined behavior from the shift can produce an
incorrect `active_mask`, leading to wrong RAS error reporting and
potentially incorrect firmware configuration. While not a crash, UB can
cause any result including crashes on some compilers/architectures.
Record: Severity = MEDIUM-HIGH (UB, incorrect hardware config, potential
RAS malfunction).

**Step 8.4:**
- BENEFIT: Fixes real UB on production hardware (large AMD GPUs),
  ensures correct memory controller tracking
- RISK: 2-line type widening change, extremely low risk of regression
Record: High benefit, very low risk.

---

## PHASE 9: FINAL SYNTHESIS

**Evidence FOR backporting:**
- Fixes undefined behavior (shift-out-of-bounds) that is a clear
  violation of the C standard
- Affects real hardware (AMD GPUs with >= 32 UMC instances, e.g., MI300
  series)
- Minimal, 2-line fix that is obviously correct
- Reviewed by AMD engineer, signed off by amdgpu maintainer
- `active_mask` is used in RAS (reliability) error handling — getting
  this wrong affects hardware reliability monitoring
- Bug has existed since v6.4, present in all current stable trees except
  6.1.y
- Pattern matches other accepted stable fixes (shift-type fixes in
  amdgpu, e.g., `BIT()` -> `BIT_ULL()`)

**Evidence AGAINST backporting:**
- No Fixes: tag (expected)
- No explicit bug report or syzbot report
- Impact is limited to specific large GPU configurations
- Context differs slightly from stable trees (may need minor adaptation)

**Stable Rules Checklist:**
1. Obviously correct and tested? **YES** — type widening is trivially
   correct
2. Fixes a real bug? **YES** — undefined behavior per C standard
3. Important issue? **YES** — UB can cause incorrect hardware
   configuration
4. Small and contained? **YES** — 2 lines in 1 file
5. No new features/APIs? **YES** — pure fix
6. Can apply to stable? **YES** — with minor context adaptation

---

## Verification

- [Phase 1] Parsed subject: "drm/amdgpu: fix shift-out-of-bounds" —
  clear fix commit
- [Phase 2] Diff analysis: 2 lines changed — `uint32_t` -> `u64` and `1`
  -> `1ULL` in `amdgpu_discovery_read_from_harvest_table()`
- [Phase 3] git blame: buggy code introduced by commit 2b595659d5aec7
  (Candice Li, Feb 2023, v6.4)
- [Phase 3] git merge-base: confirmed commit 2b595659d5aec7 is in v6.4
  and v6.6, NOT in v6.1
- [Phase 3] git log --author: Hawking Zhang is a prolific AMD GPU
  contributor
- [Phase 4] b4 dig: could not find original submission on lore (AMD GPU
  patches go through freedesktop.org)
- [Phase 4] Web search: found related shift fixes in amdgpu but not
  exact patch thread
- [Phase 5] Grep: `active_mask` is `unsigned long` (64-bit), used by
  LOOP_UMC_NODE_INST macro, PSP firmware init, and RAS error queries
- [Phase 5] Grep: `node_inst_num` is `uint32_t`, incremented per
  UMC_HWID found; on gmc_v9_0, divided by 4 (can be 32+ on large GPUs)
- [Phase 6] Code exists in stable trees 6.6.y+; context differs slightly
  (bhdr vs table_info helper)
- [Phase 8] Failure mode: UB from shift, potentially incorrect
  active_mask affecting RAS operations
- UNVERIFIED: Exact patch discussion on amd-gfx mailing list (not found
  via search)
- UNVERIFIED: Whether UBSAN has actually fired on this in practice (no
  syzbot report)

**YES**

 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index af3d2fd61cf3f..32455b01bceb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -774,7 +774,7 @@ static void amdgpu_discovery_read_from_harvest_table(struct amdgpu_device *adev,
 	struct harvest_table *harvest_info;
 	u16 offset;
 	int i;
-	uint32_t umc_harvest_config = 0;
+	u64 umc_harvest_config = 0;
 
 	bhdr = (struct binary_header *)discovery_bin;
 	offset = le16_to_cpu(bhdr->table_list[HARVEST_INFO].offset);
@@ -830,7 +830,7 @@ static void amdgpu_discovery_read_from_harvest_table(struct amdgpu_device *adev,
 		}
 	}
 
-	adev->umc.active_mask = ((1 << adev->umc.node_inst_num) - 1) &
+	adev->umc.active_mask = ((1ULL << adev->umc.node_inst_num) - 1ULL) &
 				~umc_harvest_config;
 }
 
-- 
2.53.0


  parent reply	other threads:[~2026-04-20 13:26 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: fix DF NULL pointer issue for soc24 Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.18] drm/ttm: Avoid invoking the OOM killer when reading back swapped content Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 6.18] drm/vc4: Release runtime PM reference after binding V3D Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: remove duplicate format modifier Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: unlock cancel_delayed_work_sync for hang_detect_work Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.1] drm/amd/display: Merge pipes for validate Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix a memory leak in hang state error path Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Protect madv read in vc4_gem_object_mmap() with madv_lock Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Fix cursor pos at overlay plane edges on DCN4 Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.1] drm/msm/dpu: fix vblank IRQ registration before atomic_mode_set Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 6.18] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amd/display: bios_parser: fix GPIO I2C line off-by-one Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Handle IH v7_1 reg offset differences Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/vcn4.0.3: gate per-queue reset by PSP SOS program version Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/imx: parallel-display: add DRM_DISPLAY_HELPER for DRM_IMX_PARALLEL_DISPLAY Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix amdgpu_userq_evict Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amdgpu: validate fence_count in wait_fences ioctl Sasha Levin
2026-04-20 13:18 ` Sasha Levin [this message]
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: remove queue from doorbell xa during clean up Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdkfd: fix kernel crash on releasing NULL sysfs entry Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: clear related counter after RAS eeprom reset Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Restore full update for tiling change to linear Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0] drm/amdgpu: fix array out of bounds accesses for mes sw_fini Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Exit IPS w/ DC helper for all dc_set_power_state cases Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: Check for multiplication overflow in checkpoint stack size Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/prime: Limit scatter list size with dedicated DMA device Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Clamp dc_cursor_position x_hotspot to prevent integer overflow Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: defer queue publication until create completes Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/userq: fix dma_fence refcount underflow in userq path Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: guard atom_context in devcoredump VBIOS dump Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amd/display: Avoid turning off the PHY when OTG is running for DVI Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Revert setting up Retry based Thrashing on GFX 12.1 Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amd/pm: Avoid overflow when sorting pp_feature list Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Fix number of opp Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/panel-edp: Change BOE NV140WUM-N64 timings Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amd/display: Fix HWSS v3 fast path determination Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] drm/mediatek: mtk_dsi: enable hs clock during pre-enable Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix memory leak of BO array in hang state Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Remove invalid DPSTREAMCLK mask usage Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add CMN N116BCL-EAK (C2) Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Add default reset method for soc_v1_0 Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: cleanup amdgpu_userq_get/put where not needed Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix some more bug in amdgpu_gem_va_ioctl Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] fbdev: omap2: fix inconsistent lock returns in omapfb_mmap Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm: gpu: msm: forbid mem reclaim from reset Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add AUO B116XAT04.1 (HW: 1A) Sasha Levin
2026-04-20 13:22 ` [PATCH AUTOSEL 7.0-6.6] drm/gem-dma: set VM_DONTDUMP for mmap Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260420132314.1023554-121-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Hawking.Zhang@amd.com \
    --cc=Likun.Gao@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=simona@ffwll.ch \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox