[PATCH AUTOSEL 7.0-6.18] drm/amdgpu: clear related counter after RAS eeprom reset

public inbox for dri-devel@lists.freedesktop.org
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Tao Zhou <tao.zhou1@amd.com>,
	Hawking Zhang <Hawking.Zhang@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Sasha Levin <sashal@kernel.org>,
	christian.koenig@amd.com, airlied@gmail.com, simona@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: clear related counter after RAS eeprom reset
Date: Mon, 20 Apr 2026 09:18:58 -0400	[thread overview]
Message-ID: <20260420132314.1023554-144-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Tao Zhou <tao.zhou1@amd.com>

[ Upstream commit 3d77ca68eb0b48f88cc891d1b98f109b68e2ffcf ]

Make eeprom data and its counter consistent.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed. Let me compile the full analysis.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: Subject Line
- **Subsystem**: `drm/amdgpu` (AMD GPU driver, RAS subsystem)
- **Action verb**: "clear" (indicating missing cleanup / consistency
  fix)
- **Summary**: Clear the `count_saved` counter when RAS EEPROM table is
  reset

Record: [drm/amdgpu] [clear] [Ensure count_saved is zeroed when EEPROM
table is reset]

### Step 1.2: Tags
- **Signed-off-by**: Tao Zhou (author, regular AMD RAS contributor)
- **Reviewed-by**: Hawking Zhang (AMD subsystem lead for RAS)
- **Signed-off-by**: Alex Deucher (AMD GPU maintainer, final commit)
- No Fixes: tag, no Reported-by:, no Cc: stable

Record: Author is a regular AMD RAS contributor. Reviewed by AMD's RAS
lead.

### Step 1.3: Commit Body
- "Make eeprom data and its counter consistent"
- Terse description, but the intent is clear: a data consistency issue
  between EEPROM state and in-memory counters.

Record: Bug is a data consistency issue. After EEPROM reset,
`count_saved` retains a stale value while all other counters are zeroed.

### Step 1.4: Hidden Bug Fix Detection
This is a data consistency bug disguised as a minor cleanup. The word
"consistent" signals that the code was **inconsistent** before—i.e., the
counter was wrong after a reset. This is a real bug fix.

Record: Yes, this is a hidden bug fix. The "consistent" language masks
the fact that stale `count_saved` causes wrong data to be written to
EEPROM on subsequent saves.

---

## PHASE 2: DIFF ANALYSIS

### Step 2.1: Inventory
- **Files**: 1 file modified (`amdgpu_ras_eeprom.c`)
- **Lines**: +3 (one comment, one NULL check, one assignment)
- **Function modified**: `amdgpu_ras_eeprom_reset_table()`
- **Scope**: Single-file surgical fix

### Step 2.2: Code Flow Change
- **Before**: `amdgpu_ras_eeprom_reset_table()` zeroed `ras_num_recs`,
  `ras_num_bad_pages`, `ras_num_mca_recs`, `ras_num_pa_recs`, `ras_fri`,
  `bad_channel_bitmap`, and `update_channel_flag`, but left
  `eh_data->count_saved` unchanged.
- **After**: Also zeroes `con->eh_data->count_saved` (with NULL guard on
  `eh_data`).

### Step 2.3: Bug Mechanism
This is a **data consistency / correctness bug**. `count_saved` is used
as an array index in `amdgpu_ras_save_bad_pages()`:

```3341:3341:drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
                                        &data->bps[data->count_saved],
unit_num)) {
```

If EEPROM is reset but `count_saved` retains value N from before, the
next save operation starts writing from `bps[N]` instead of `bps[0]`.
This means:
1. **Wrong data is written to EEPROM** (skipping the first N entries)
2. **Potential out-of-bounds access** if bps array was reorganized

There are direct call sequences that trigger this: `reset_table` ->
`save_bad_pages` at lines 1783-1784 and 3837-3838.

### Step 2.4: Fix Quality
- Obviously correct: when all EEPROM records are cleared, the "saved
  count" must be 0
- Minimal: 3 lines, single variable assignment with NULL guard
- No regression risk: the NULL check prevents any potential NULL deref

---

## PHASE 3: GIT HISTORY INVESTIGATION

### Step 3.1: Blame
The reset function's body was built incrementally since v5.3 (2019) by
Andrey Grodzovsky, with additions by Luben Tuikov (2021), Stanley Yang
(2022), and Tao Zhou (2024). The `count_saved` field was introduced in
commit d45c5e6845a76 by Tao Zhou (2025-07-04), first appearing in v6.18.

### Step 3.2: No Fixes: tag
No Fixes: tag present. The logical "fixes" target would be d45c5e6845a76
which introduced `count_saved` without clearing it in the reset path.

### Step 3.3: File History
The file is actively developed with 20+ recent commits. Patch 1/2 of
this series ("compatible with specific RAS old eeprom format") modifies
`amdgpu_ras.c` and is thematically related but functionally independent.

### Step 3.4: Author
Tao Zhou is a frequent AMD RAS contributor (10+ recent commits to RAS
code) and the same author who introduced `count_saved`.

### Step 3.5: Dependencies
- **Requires** d45c5e6845a76 (introduces `count_saved` field) - present
  only in v6.18+
- **Does NOT depend on** patch 1/2 of the series (separate bug fix)
- Standalone fix

---

## PHASE 4: MAILING LIST RESEARCH

- Found at `lists.freedesktop.org/archives/amd-
  gfx/2026-February/139281.html`
- Part of a 2-patch series; b4 dig did not find a match (AMD internal
  submission path)
- Reviewed-by from Hawking Zhang (AMD RAS lead) for the entire series
- No NAKs or concerns raised
- No explicit stable nomination by reviewers

---

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1-5.4: Function/Caller Analysis
`amdgpu_ras_eeprom_reset_table()` is called from:
1. **`amdgpu_ras_debugfs_eeprom_write()`** - user-triggered via debugfs
   (privileged only)
2. **`amdgpu_ras_eeprom_init()`** - during driver initialization (new
   table creation)
3. **`amdgpu_ras_eeprom_check_and_recover()`** - reset + immediate
   save_bad_pages
4. **`amdgpu_ras_init_badpage_info()`** - reset + immediate
   save_bad_pages (format upgrade path)

Call sites 3 and 4 are the dangerous ones: they call `reset_table`
immediately followed by `save_bad_pages`, which will use the stale
`count_saved` as an array index.

---

## PHASE 6: STABLE TREE ANALYSIS

### Step 6.1: Which stable trees have the buggy code?
`count_saved` was introduced in d45c5e6845a76, first in v6.18. This fix
is only relevant for **v6.18.y and newer** stable trees.

### Step 6.2: Backport Difficulty
The patch is 3 lines, no surrounding context changes. Clean apply
expected on any tree containing d45c5e6845a76.

---

## PHASE 7: SUBSYSTEM CONTEXT

- **Subsystem**: drm/amdgpu RAS (Reliability, Availability,
  Serviceability)
- **Criticality**: IMPORTANT - RAS tracks and retires bad GPU memory
  pages. Incorrect tracking means potentially using defective memory, or
  incorrectly retiring good memory.
- Active subsystem with frequent development.

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: Affected Users
Users with AMD GPUs that support RAS (datacenter GPUs like MI200, MI300
series). This is a significant enterprise/HPC user population.

### Step 8.2: Trigger Conditions
- Triggered when EEPROM table is reset and new bad pages are
  subsequently saved
- Can occur during: checksum recovery (automatic), table format upgrade
  (automatic), debugfs user action

### Step 8.3: Failure Mode
- **Data corruption**: Wrong records written to EEPROM (persists across
  reboots)
- **Potential OOB access**: If `count_saved` points beyond current `bps`
  array bounds
- Severity: **MEDIUM-HIGH** - corrupted RAS data means bad memory pages
  might not be properly tracked

### Step 8.4: Risk-Benefit
- **Benefit**: Prevents EEPROM data corruption and potential OOB access.
  Medium-high benefit for datacenter GPU users.
- **Risk**: Very low. 3 lines, single variable zeroed with NULL guard.
  Impossible to introduce a regression.

---

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: Evidence

**FOR backporting:**
- Fixes a real data consistency bug: stale `count_saved` causes wrong
  data written to EEPROM
- Potential OOB array access from stale index
- 3-line fix, surgical, obviously correct
- Reviewed by AMD RAS subsystem lead (Hawking Zhang)
- Author is the same person who introduced `count_saved` (understands
  the code)
- Direct trigger paths exist (reset_table -> save_bad_pages sequences)
- Affects enterprise/datacenter GPU users (RAS is critical for data
  integrity)

**AGAINST backporting:**
- Only applicable to v6.18+ trees (where `count_saved` exists)
- No user-reported crash or bug report - preventive fix by the developer
- Limited user population (AMD datacenter GPUs with RAS)

### Step 9.2: Stable Rules Checklist
1. Obviously correct and tested? **YES** - zeroing a counter when table
   is reset is trivially correct
2. Fixes a real bug? **YES** - stale index causes wrong EEPROM writes
   and potential OOB access
3. Important issue? **YES** - data corruption of persistent RAS bad page
   table
4. Small and contained? **YES** - 3 lines in 1 file
5. No new features? **YES** - pure fix
6. Can apply to stable? **YES** - clean apply on v6.18+

### Step 9.3: Exception Categories
Not applicable (standard bug fix, not a quirk/ID/DT addition).

---

## Verification

- [Phase 1] Parsed tags: Reviewed-by Hawking Zhang, Signed-off-by Alex
  Deucher, Tao Zhou as author
- [Phase 2] Diff analysis: 3 lines added in
  `amdgpu_ras_eeprom_reset_table()`, clearing `count_saved` with NULL
  guard
- [Phase 2] Traced `count_saved` usage: used as array index in
  `amdgpu_ras_save_bad_pages()` at lines 3341, 3348
- [Phase 3] git blame: surrounding code added by various AMD developers
  2019-2024; `count_saved` introduced in d45c5e6845a76 (v6.18)
- [Phase 3] git merge-base --is-ancestor: confirmed d45c5e6845a76 NOT in
  v6.17, IS in v6.18
- [Phase 3] Author check: Tao Zhou has 10+ recent RAS commits,
  introduced `count_saved` himself
- [Phase 4] Found patch discussion at lists.freedesktop.org, Hawking
  Zhang reviewed both patches
- [Phase 4] Series is 2/2, patch 1/2 is independent (different file,
  different bug)
- [Phase 5] Traced callers of `amdgpu_ras_eeprom_reset_table`: found 4
  call sites, 2 immediately follow with `save_bad_pages` (lines
  1783-1784, 3837-3838) - confirming the bug is triggerable
- [Phase 6] Fix only applies to v6.18+ stable trees
- [Phase 8] Failure mode: wrong records written to EEPROM (data
  corruption), potential OOB array access

The fix is small, surgical, obviously correct, and addresses a real data
consistency bug that can cause EEPROM corruption and potential out-of-
bounds access. It meets all stable kernel criteria for applicable trees
(v6.18+).

**YES**

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 6fba9d5b29ea6..44fba4b6aa92a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -508,6 +508,9 @@ int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control)
 	control->bad_channel_bitmap = 0;
 	amdgpu_dpm_send_hbm_bad_channel_flag(adev, control->bad_channel_bitmap);
 	con->update_channel_flag = false;
+	/* there is no record on eeprom now, clear the counter */
+	if (con->eh_data)
+		con->eh_data->count_saved = 0;

 	amdgpu_ras_debugfs_set_ret_size(control);

-- 
2.53.0

next prev parent reply	other threads:[~2026-04-20 13:27 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: fix DF NULL pointer issue for soc24 Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.18] drm/ttm: Avoid invoking the OOM killer when reading back swapped content Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 6.18] drm/vc4: Release runtime PM reference after binding V3D Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.19] drm/xe/vf: Wait for all fixups before using default LRCs Sasha Levin
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: remove duplicate format modifier Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: unlock cancel_delayed_work_sync for hang_detect_work Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.1] drm/amd/display: Merge pipes for validate Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/xe: Fix bug in idledly unit conversion Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0] drm/xe: Skip adding PRL entry to NULL VMA Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix a memory leak in hang state error path Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 6.18] drm/vc4: Protect madv read in vc4_gem_object_mmap() with madv_lock Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Fix cursor pos at overlay plane edges on DCN4 Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.1] drm/msm/dpu: fix vblank IRQ registration before atomic_mode_set Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 6.18] drm/amdgpu: Handle GPU page faults correctly on non-4K page systems Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amd/display: bios_parser: fix GPIO I2C line off-by-one Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Handle IH v7_1 reg offset differences Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/vcn4.0.3: gate per-queue reset by PSP SOS program version Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/imx: parallel-display: add DRM_DISPLAY_HELPER for DRM_IMX_PARALLEL_DISPLAY Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix amdgpu_userq_evict Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-5.10] drm/amdgpu: validate fence_count in wait_fences ioctl Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.6] drm/amdgpu: fix shift-out-of-bounds when updating umc active mask Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: remove queue from doorbell xa during clean up Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] drm/amdkfd: fix kernel crash on releasing NULL sysfs entry Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] drm/xe/guc: Add Wa_14025883347 for GuC DMA failure on reset Sasha Levin
2026-04-20 13:18 ` Sasha Levin [this message]
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Restore full update for tiling change to linear Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0] drm/amdgpu: fix array out of bounds accesses for mes sw_fini Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Exit IPS w/ DC helper for all dc_set_power_state cases Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: Check for multiplication overflow in checkpoint stack size Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] drm/prime: Limit scatter list size with dedicated DMA device Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Clamp dc_cursor_position x_hotspot to prevent integer overflow Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: defer queue publication until create completes Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu/userq: fix dma_fence refcount underflow in userq path Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.12] drm/amdgpu: guard atom_context in devcoredump VBIOS dump Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.18] drm/amd/display: Avoid turning off the PHY when OTG is running for DVI Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Revert setting up Retry based Thrashing on GFX 12.1 Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] drm/amd/pm: Avoid overflow when sorting pp_feature list Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/amd/display: Fix number of opp Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.19] drm/panel-edp: Change BOE NV140WUM-N64 timings Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amd/display: Fix HWSS v3 fast path determination Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] drm/mediatek: mtk_dsi: enable hs clock during pre-enable Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 6.18] drm/vc4: Fix memory leak of BO array in hang state Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.12] drm/amd/display: Remove invalid DPSTREAMCLK mask usage Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add CMN N116BCL-EAK (C2) Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu: Add default reset method for soc_v1_0 Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0] drm/amdgpu/userq: cleanup amdgpu_userq_get/put where not needed Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/amdgpu: fix some more bug in amdgpu_gem_va_ioctl Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] fbdev: omap2: fix inconsistent lock returns in omapfb_mmap Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm: gpu: msm: forbid mem reclaim from reset Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] drm/panel-edp: Add AUO B116XAT04.1 (HW: 1A) Sasha Levin
2026-04-20 13:22 ` [PATCH AUTOSEL 7.0-6.6] drm/gem-dma: set VM_DONTDUMP for mmap Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:6fba9d5b29ea dfblob:44fba4b6aa92 )
 OR (
bs:"[PATCH AUTOSEL 7.0-6.18] drm/amdgpu: clear related counter after RAS eeprom reset" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260420132314.1023554-144-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Hawking.Zhang@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=simona@ffwll.ch \
    --cc=stable@vger.kernel.org \
    --cc=tao.zhou1@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox