[PATCH AUTOSEL 7.0] drm/amdkfd: check if vm ready in svm map and unmap to gpu

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: YuanShang <YuanShang.Mao@amd.com>,
	Philip Yang <philip.yang@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Sasha Levin <sashal@kernel.org>,
	Felix.Kuehling@amd.com, christian.koenig@amd.com,
	airlied@gmail.com, simona@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0] drm/amdkfd: check if vm ready in svm map and unmap to gpu
Date: Tue,  5 May 2026 05:51:45 -0400	[thread overview]
Message-ID: <20260505095149.512052-29-sashal@kernel.org> (raw)
In-Reply-To: <20260505095149.512052-1-sashal@kernel.org>

From: YuanShang <YuanShang.Mao@amd.com>

[ Upstream commit d0f5711fa14a09c010537375cf34893cd33bc2ee ]

Don't map or unmap svm range to gpu if vm is not ready for updates.

Why: DRM entity may already be killed when the svm worker try to
update gpu vm.

Signed-off-by: YuanShang <YuanShang.Mao@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 55f8e366c326980174a4f2b9501b524d8eb25135)
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Phase 1: Commit Message Forensics
Step 1.1 Record: Subsystem `drm/amdkfd`, action verb `check`, intent:
avoid SVM GPU VM map/unmap when the AMDGPU VM cannot accept updates.

Step 1.2 Record: Tags found in the actual commit: `Signed-off-by:
YuanShang <YuanShang.Mao@amd.com>`, `Reviewed-by: Philip Yang
<philip.yang@amd.com>`, `Signed-off-by: Alex Deucher
<alexander.deucher@amd.com>`. No `Fixes:`, no `Reported-by:`, no
`Tested-by:`, no `Cc: stable`.

Step 1.3 Record: The commit says the SVM worker may try to update a GPU
VM after the DRM scheduler entity has already been killed. The user-
visible symptom was verified from the lore thread: “Trying to push to a
killed entity”, SDMA timeout, GPU reset, and a hung
`svm_range_restore_work` kworker blocked in `dma_fence_wait_timeout()`
via `svm_range_validate_and_map()`.

Step 1.4 Record: This is a hidden bug fix despite the neutral “check”
wording. It prevents submitting VM update jobs to a stopped/killed VM
update entity, which otherwise can leave fences unsignaled and hang
worker context.

## Phase 2: Diff Analysis
Step 2.1 Record: One file changed:
`drivers/gpu/drm/amd/amdkfd/kfd_svm.c`, 11 insertions. Modified
functions: `svm_range_unmap_from_gpu()` and `svm_range_map_to_gpu()`.
Scope: single-file surgical fix.

Step 2.2 Record: Before, both SVM unmap and map directly called
`amdgpu_vm_update_range()`. After, both first call `amdgpu_vm_ready(vm)`
and return `-EINVAL` if the VM is not ready. Affected path is VM page
table update submission from SVM map/unmap, including restore worker and
MMU notifier/unmap paths.

Step 2.3 Record: Bug category is synchronization/lifetime correctness
around process teardown. `amdgpu_vm_ready()` in current mainline
verifies the VM is not evicting, has no evicted PTs, and its
immediate/delayed VM update scheduler entities are not stopped. The fix
avoids queueing jobs after those entities are killed.

Step 2.4 Record: Fix quality is good: 11 lines, no new API, no feature,
no data structure changes. Regression risk is low, mainly early
returning `-EINVAL` when VM updates cannot run anyway. Backport risk is
higher for older trees because `amdgpu_vm_ready()` only gained stopped-
entity checks in commit `f101c13a8720c7`; older stable trees need that
or an equivalent prerequisite for this patch to address the killed-
entity failure.

## Phase 3: Git History Investigation
Step 3.1 Record: Blame shows SVM map/unmap infrastructure was introduced
by `f80fe9d3c114` (“drm/amdkfd: map svm range to GPUs”, first in
`v5.14-rc1`) and later reshaped by commits including `6c1a7867734`
(`v5.18-rc1`). The missing readiness guard has existed in these SVM
paths for a long time.

Step 3.2 Record: No `Fixes:` tag, so no direct target to follow.

Step 3.3 Record: Recent file history contains many SVM fixes, including
UAF, address conversion, PTE clearing, restore work, and retry-fault
race fixes. Related commit `597eb70f7ff7` / upstream `10c382ec6c6d`
(“drm/amdkfd: Don’t clear PT after process killed”) added an
`amdgpu_vm_ready()` guard in a different KFD GPUVM path and was
explicitly stable-tagged.

Step 3.4 Record: `git log --author='YuanShang' -10 --
drivers/gpu/drm/amd/amdkfd` produced no reachable prior commits in this
checkout. The patch was reviewed by Philip Yang, a regular AMD KFD
contributor, and committed by Alex Deucher.

Step 3.5 Record: Dependency identified: `f101c13a8720c7` (“drm/amdgpu:
fix task hang from failed job submission during process kill”) teaches
`amdgpu_vm_ready()` to check stopped VM update entities. Without it,
this candidate’s guard does not fully detect the killed-entity condition
in older stable trees.

## Phase 4: Mailing List And External Research
Step 4.1 Record: `b4 dig -c 55f8e366c326...` found the original
submission at `https://patch.msgid.link/20260326103656.487304-1-
YuanShang.Mao@amd.com`. `b4 dig -a` found only v1, standalone. WebFetch
to lore was blocked by Anubis, but `b4 dig -m` retrieved the mbox
successfully.

Step 4.2 Record: `b4 dig -w` showed original recipients were YuanShang
and `amd-gfx@lists.freedesktop.org`. The thread later included Christian
König and Philip Yang.

Step 4.3 Record: No separate bugzilla/syzbot link. The thread itself
contains the bug log: killed entity error, SDMA timeout, GPU reset,
recovered wedge, and hung kworker in `svm_range_restore_work`.

Step 4.4 Record: Philip Yang stated the earlier “Don’t clear PT after
process killed” patch fixed one path and this patch fixes another path,
then gave `Reviewed-by: Philip Yang <philip.yang@amd.com>`. No NAKs
found.

Step 4.5 Record: Stable-specific web search could not be verified
because WebFetch to lore/stable timed out or hit Anubis. No stable
nomination for this exact patch found in the mbox.

## Phase 5: Code Semantic Analysis
Step 5.1 Record: Key functions: `svm_range_unmap_from_gpu()`,
`svm_range_map_to_gpu()`.

Step 5.2 Record: Callers verified: `svm_range_unmap_from_gpu()` is
called by `svm_range_unmap_from_gpus()`, reached from CPU unmap/MMU
notifier handling and SVM validation with PROT_NONE.
`svm_range_map_to_gpu()` is called by `svm_range_map_to_gpus()`, reached
from `svm_range_validate_and_map()`.

Step 5.3 Record: Key callees: both changed functions call
`amdgpu_vm_update_range()`. For SDMA VM updates, that path
allocates/submits an AMDGPU job; `amdgpu_job_submit()` arms the
scheduler job and calls `drm_sched_entity_push_job()`.

Step 5.4 Record: Reachability verified: `svm_range_restore_work()` calls
`svm_range_validate_and_map()`, which calls `svm_range_map_to_gpus()`
and then `svm_range_map_to_gpu()`. The lore log shows exactly this call
chain in a hung kworker. GPU page fault and MMU notifier paths also
reach the same validation/unmap functions.

Step 5.5 Record: Similar pattern verified: `amdgpu_amdkfd_gpuvm.c`
already has an `amdgpu_vm_ready()` guard with the comment “VM entity
stopped if process killed”; `amdgpu_cs.c` and `amdgpu_gem.c` also check
VM readiness before clearing freed mappings.

## Phase 6: Stable Tree Analysis
Step 6.1 Record: The SVM map/unmap functions exist in `v5.15`, `v6.1`,
`v6.6`, and `v6.8`, and none of those extracted versions had the new
guards. The reported log was from Ubuntu `6.8.0-90-generic`, confirming
a stable-derived affected kernel.

Step 6.2 Record: Backport difficulty: minor to moderate. `v6.8`, `v6.6`,
and `v6.1` have the same conceptual functions but older
`amdgpu_vm_update_range()` signatures. `v5.15` uses older
`amdgpu_vm_bo_update_mapping()` in this path. Older trees also need
`f101c13a8720c7` or equivalent stopped-entity readiness logic.

Step 6.3 Record: Related fix `597eb70f7ff7`/`10c382ec6c6d` addresses a
different process-kill VM update path and was stable-tagged. It does not
cover SVM map/unmap; Philip Yang explicitly confirmed this patch fixes
another path.

## Phase 7: Subsystem Context
Step 7.1 Record: Subsystem is AMDGPU KFD SVM/HMM GPU memory management.
Criticality: important, affecting AMD compute users using KFD SVM, GPU
page faults, migration, and process teardown.

Step 7.2 Record: Subsystem is active; recent history shows many SVM
correctness fixes. The bug is in a mature path present since `v5.14+`,
not just brand-new code.

## Phase 8: Impact And Risk
Step 8.1 Record: Affected population is driver/config/hardware specific:
AMDGPU KFD users with SVM-capable compute workloads.

Step 8.2 Record: Trigger requires SVM VM update work racing with forced
process kill or teardown after VM scheduler entities are stopped. The
lore log verifies a real trigger. Whether it is fully unprivileged
depends on render/KFD device permissions and was not independently
verified.

Step 8.3 Record: Failure mode is severe: verified killed-entity error,
SDMA ring timeout, GPU reset, recovered device wedge, and hung kworker
for more than 245 seconds. Severity: HIGH, arguably CRITICAL for
affected systems.

Step 8.4 Record: Benefit is high for affected AMD KFD users because it
avoids a real hung-task/GPU-reset failure. Risk is low in mainline-
shaped code because the fix only refuses impossible VM updates. Risk for
older stable trees is manageable but requires prerequisite/backport care
around `amdgpu_vm_ready()` semantics.

## Phase 9: Final Synthesis
Step 9.1 Record: Evidence for backporting: real user log, severe
hang/GPU reset, small surgical fix, reviewed by Philip Yang, related
already-stable process-kill fix shows same class of bug, affected code
exists in stable-derived kernels. Evidence against: no `Fixes:` or `Cc:
stable`, and older trees need dependency/backport adjustment.
Unresolved: exact clean-apply status for every active stable branch was
not tested.

Step 9.2 Record: Stable rules checklist: obviously correct and reviewed:
yes, with dependency caveat. Fixes real bug: yes, verified by lore log.
Important issue: yes, hung task/GPU reset. Small and contained: yes, 11
lines in one file. No new feature/API: yes. Applies to stable: likely
needs minor backport adjustment and `f101c13a8720c7` or equivalent for
older trees.

Step 9.3 Record: No exception category applies; this is not a device ID,
quirk, DT, build, or documentation-only patch.

Step 9.4 Record: Decision is to backport, but not as an isolated blind
cherry-pick to older stable trees. It should be taken with the stopped-
entity readiness prerequisite/backport so `amdgpu_vm_ready()` actually
detects the process-kill condition.

## Verification
- Phase 1: Parsed target commit with `git show`; confirmed tags and no
  `Fixes:`/stable tag.
- Phase 2: Diff verified as 11 insertions in
  `svm_range_unmap_from_gpu()` and `svm_range_map_to_gpu()`.
- Phase 3: `git blame` verified SVM map/unmap history; `git describe
  --contains` placed the original SVM mapping in `v5.14-rc1` and later
  function reshaping in `v5.18-rc1`.
- Phase 3: `git show f101c13a8720c7` verified `amdgpu_vm_ready()` gained
  stopped-entity checks.
- Phase 3: `git show 597eb70f7ff7` verified the related stable-tagged
  “Don’t clear PT after process killed” guard in another path.
- Phase 4: `b4 dig`, `b4 dig -a`, `b4 dig -w`, and decoded mbox verified
  the lore thread, single v1 patch, recipients, log, Christian’s “Looks
  correct”, and Philip’s `Reviewed-by`.
- Phase 5: `rg` and file reads traced map/unmap callers through
  `svm_range_validate_and_map()` and `svm_range_restore_work()`.
- Phase 6: Extracted `v5.15`, `v6.1`, `v6.6`, and `v6.8` files; verified
  the affected SVM functions exist without the new guard.
- Phase 8: Lore log verified severity: killed entity, SDMA timeout, GPU
  reset, device wedge recovery, and hung kworker.
- UNVERIFIED: WebFetch lore/stable searches were blocked/timed out, and
  I did not test applying the patch to each stable branch.

**YES**

 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 080242f9981b0..addb86803d9ae 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1363,6 +1363,12 @@ svm_range_unmap_from_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,

 	pr_debug("CPU[0x%llx 0x%llx] -> GPU[0x%llx 0x%llx]\n", start, last,
 		gpu_start, gpu_end);
+
+	if (!amdgpu_vm_ready(vm)) {
+		pr_debug("VM not ready, canceling unmap\n");
+		return -EINVAL;
+	}
+
 	return amdgpu_vm_update_range(adev, vm, false, true, true, false, NULL, gpu_start,
 				      gpu_end, init_pte_value, 0, 0, NULL, NULL,
 				      fence);
@@ -1440,6 +1446,11 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
 	pr_debug("svms 0x%p [0x%lx 0x%lx] readonly %d\n", prange->svms,
 		 last_start, last_start + npages - 1, readonly);

+	if (!amdgpu_vm_ready(vm)) {
+		pr_debug("VM not ready, canceling map\n");
+		return -EINVAL;
+	}
+
 	for (i = offset; i < offset + npages; i++) {
 		uint64_t gpu_start;
 		uint64_t gpu_end;
-- 
2.53.0

     prev parent reply	other threads:[~2026-05-05  9:53 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05  9:51 [PATCH AUTOSEL 7.0-5.10] ALSA: hda: Avoid WARN_ON() for HDMI chmap slot checks Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.1] nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] drm/amd/pm: Update emit clock logic Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] smb: client: change allocation requirements in smb2_compound_op Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: handle -EAGAIN from btrfs_duplicate_item and refresh stale leaf pointer Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] nvme: add missing MODULE_ALIAS for fabrics transports Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] dpll: export __dpll_pin_change_ntf() for use under dpll_lock Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] nvme-core: fix parameter name in comment Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] nvme: add quirk NVME_QUIRK_IGNORE_DEV_SUBNQN for 144d:a808 (Samsung PM981/983/970 EVO Plus ) Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] ASoC: spacemit: move hw constraints from hw_params to startup Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] ALSA: usb-audio: apply quirk for Playstation PDP Riffmaster Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] nvmet-tcp: Don't clear tls_key when freeing sq Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] rculist: add list_splice_rcu() for private lists Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] ALSA: hda/realtek: enable mute LED support on ThinkBook 16p Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] mailbox: cix: Add IRQF_NO_SUSPEND to mailbox interrupt Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.12] ASoC: codecs: wcd937x: fix AUX PA sequencing and mixer controls Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: replace ASSERT with proper error handling in stripe lookup fallback Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] btrfs: handle unexpected free-space-tree key types Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] md/raid5: Fix UAF on IO across the reshape position Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.6] btrfs: apply first key check for readahead when possible Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.6] ASoC: aw88395: Fix kernel panic caused by invalid GPIO error pointer Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.12] nvme-tcp: teardown circular locking fixes Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: fix wrong min_objectid in btrfs_previous_item() call Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: check return value of btrfs_partially_delete_raid_extent() Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: fix raid stripe search missing entries at leaf boundaries Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: copy devid in btrfs_partially_delete_raid_extent() Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] nvme-multipath: put module reference when delayed removal work is canceled Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] btrfs: abort transaction in do_remap_reloc_trans() on failure Sasha Levin
2026-05-05  9:51 ` Sasha Levin [this message]

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:080242f9981b dfblob:addb86803d9a )
 OR (
bs:"[PATCH AUTOSEL 7.0] drm/amdkfd: check if vm ready in svm map and unmap to gpu" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260505095149.512052-29-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Felix.Kuehling@amd.com \
    --cc=YuanShang.Mao@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=philip.yang@amd.com \
    --cc=simona@ffwll.ch \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox