From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Kent Russell <kent.russell@amd.com>,
Felix Kuehling <felix.kuehling@amd.com>,
Alex Deucher <alexander.deucher@amd.com>,
Sasha Levin <sashal@kernel.org>,
Felix.Kuehling@amd.com, amd-gfx@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17-6.6] drm/amdkfd: Handle lack of READ permissions in SVM mapping
Date: Sat, 25 Oct 2025 11:58:20 -0400 [thread overview]
Message-ID: <20251025160905.3857885-269-sashal@kernel.org> (raw)
In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org>
From: Kent Russell <kent.russell@amd.com>
[ Upstream commit 0ed704d058cec7643a716a21888d58c7d03f2c3e ]
HMM assumes that pages have READ permissions by default. Inside
svm_range_validate_and_map, we add READ permissions then add WRITE
permissions if the VMA isn't read-only. This will conflict with regions
that only have PROT_WRITE or have PROT_NONE. When that happens,
svm_range_restore_work will continue to retry, silently, giving the
impression of a hang if pr_debug isn't enabled to show the retries..
If pages don't have READ permissions, simply unmap them and continue. If
they weren't mapped in the first place, this would be a no-op. Since x86
doesn't support write-only, and PROT_NONE doesn't allow reads or writes
anyways, this will allow the svm range validation to continue without
getting stuck in a loop forever on mappings we can't use with HMM.
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
Explanation
- Bug fixed: The change addresses an indefinite retry loop (apparent
hang) in HMM-backed SVM mapping when encountering VMAs without read
permission, specifically write-only mappings and PROT_NONE. The loop
is triggered because HMM assumes READ by default and the existing code
adds READ then WRITE in svm_range_validate_and_map. That conflicts
with mappings that lack READ and causes svm_range_restore_work to
silently retry forever.
- Core change: In drivers/gpu/drm/amd/amdkfd/kfd_svm.c inside
svm_range_validate_and_map, after resolving the VMA and before calling
amdgpu_hmm_range_get_pages, the patch adds a guard:
- Check: if (!(vma->vm_flags & VM_READ)) { … continue; }
- Behavior on no-READ: Acquire range lock, optionally pr_debug if
VM_WRITE is set without VM_READ, compute the intersection of the
current address range with prange, call
svm_range_unmap_from_gpus(prange, s, e,
KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU), unlock, advance addr to next,
and continue.
- This explicitly treats PROT_NONE and write-only VMAs as unmappable
for HMM/SVM and avoids mapping attempts that will never succeed.
- Containment: The change is localized to a single function and code
path used during SVM range validation/mapping. No APIs or data
structures are changed. It only affects the slow path when
encountering a VMA without VM_READ; normal mappings (with READ) follow
the existing flow unchanged.
- User impact: Prevents a hang-like condition (endless retry) that users
would experience during SVM range restoration/validation when a VMA
has PROT_NONE or write-only protection. This is a real, user-facing
bug that can stall workloads using KFD/HMM.
- Risk assessment:
- Minimal risk: The unmap is a no-op if the pages were not mapped (“If
they weren't mapped in the first place, this would be a no-op”), and
otherwise it correctly tears down mappings that cannot be used by
HMM anyway (x86 doesn’t support write-only, and PROT_NONE allows
neither reads nor writes).
- No architectural changes: No redesign or wide-reaching behavior
changes outside of this corner case. Logging is via pr_debug, so no
noisy kernel logs in production.
- Stable criteria:
- Important bugfix affecting users: Yes (prevents indefinite
retry/hang).
- Small and contained: Yes (~20 lines, one function).
- No new features: Correct.
- No broad side effects: Correct; behavior is limited to VMAs lacking
READ, which cannot be supported by HMM.
- Explicit stable tag: Not shown in the snippet, but the fix clearly
meets stable backport guidelines due to the hang avoidance and
limited scope.
Notes for backporters
- Interface compatibility: The patch relies on existing primitives
present in amdkfd SVM code paths: vma_lookup, svm_range_lock/unlock,
svm_range_unmap_from_gpus, KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU, and
prange fields (start/last). These are stable in recent kernels that
have SVM/HMM in KFD.
- Unit consistency check: Ensure the parameters passed to
svm_range_unmap_from_gpus(prange, s, e, …) are in the units expected
by your target stable branch. In the shown diff, s/e are computed as s
= max(start, prange->start) and e = min(end, prange->last) where
start/end appear to be byte addresses (start = map_start <<
PAGE_SHIFT) and prange->start/last are often page indices in KFD SVM
code. Upstream code likely uses consistent units (either all pages or
all bytes). When backporting, verify that s/e match the function’s
expected units (adjust by PAGE_SHIFT if necessary) to avoid off-by-
PAGE_SHIFT mistakes.
- Validation suggestion: Reproduce with a user VMA set to PROT_NONE or
write-only protection and trigger SVM range validation (e.g., by
causing GPU access). Before the fix, svm_range_restore_work would
continuously retry; after the fix, the range is unmapped and
validation proceeds without looping.
Why this is safe and needed
- The patch turns an unrecoverable, retry-forever condition into a
deterministic handling path by unmapping and moving on. It does not
try to force unsupported permissions and does not alter behavior for
the common case. It matches HMM’s requirement that mappings be at
least readable and avoids futile retry cycles. This is precisely the
kind of small, correctness-oriented fix that minimizes regression risk
and improves robustness for users of amdkfd/HMM.
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 3d8b20828c068..cecdbcea0bb90 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1714,6 +1714,29 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
next = min(vma->vm_end, end);
npages = (next - addr) >> PAGE_SHIFT;
+ /* HMM requires at least READ permissions. If provided with PROT_NONE,
+ * unmap the memory. If it's not already mapped, this is a no-op
+ * If PROT_WRITE is provided without READ, warn first then unmap
+ */
+ if (!(vma->vm_flags & VM_READ)) {
+ unsigned long e, s;
+
+ svm_range_lock(prange);
+ if (vma->vm_flags & VM_WRITE)
+ pr_debug("VM_WRITE without VM_READ is not supported");
+ s = max(start, prange->start);
+ e = min(end, prange->last);
+ if (e >= s)
+ r = svm_range_unmap_from_gpus(prange, s, e,
+ KFD_SVM_UNMAP_TRIGGER_UNMAP_FROM_CPU);
+ svm_range_unlock(prange);
+ /* If unmap returns non-zero, we'll bail on the next for loop
+ * iteration, so just leave r and continue
+ */
+ addr = next;
+ continue;
+ }
+
WRITE_ONCE(p->svms.faulting_task, current);
r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
readonly, owner, NULL,
--
2.51.0
next prev parent reply other threads:[~2025-10-25 16:21 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] amd/amdkfd: enhance kfd process check in switch partition Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17-6.12] drm/amdgpu: fix nullptr err of vm_handle_moved Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17-6.1] drm/amdgpu: Allow kfd CRIU with no buffer objects Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/amd/pm: refine amdgpu pm sysfs node error code Sasha Levin
2025-10-25 15:58 ` Sasha Levin [this message]
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.6] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw Sasha Levin
2025-10-25 16:00 ` [PATCH AUTOSEL 6.17-5.4] drm/amdkfd: return -ENOTTY for unsupported IOCTLs Sasha Levin
2025-10-25 16:00 ` [PATCH AUTOSEL 6.17-6.1] drm/amdkfd: fix vram allocation failure for a special case Sasha Levin
2025-10-25 16:00 ` [PATCH AUTOSEL 6.17-5.4] drm/amdkfd: Tie UNMAP_LATENCY to queue_preemption Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251025160905.3857885-269-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=felix.kuehling@amd.com \
--cc=kent.russell@amd.com \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox