All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)
@ 2026-06-12 13:12 Jonathan L.
  2026-06-15 12:01 ` Timur Kristóf
  0 siblings, 1 reply; 3+ messages in thread
From: Jonathan L. @ 2026-06-12 13:12 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Harish.Kasiviswanathan,
	timur.kristof

[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]

Hi team,

I am reporting a regression in the AMDGPU driver affecting the Strix Halo
APU (Radeon 8060S, gfx1151). While everything works correctly on kernel
7.1.0-rc5, upgrading to 7.1.0-rc7 causes GPU memory operations to hang
indefinitely. This occurs during tasks like torch.empty() or model weight
transfers in ComfyUI (PyTorch 2.11.0+rocm7.13).

I have bisected the changes in drivers/gpu/drm/amd/ between rc5 and rc7 and
identified the following potential causes:

1.  amdgpu_hmm.c (Christian König):

  - 1c824497d: Changing the invalidate callback to wait on the VM root BO
reservation lock may be introducing a deadlock.
  - 962d684b5: Moving the notifier_seq read outside the retry loop could
cause infinite retries with a stale sequence number.
  - 58bafc666: Changes to userptr submission waiting.

2.  gfxhub_v12_0.c (Timur Kristóf):

  - 40bab7c60: The change to CRASH_ON_*_FAULT bits might be causing the GPU
to retry failed memory accesses indefinitely rather than surfacing a fault.

3.  gmc_v12_0.c (Harish Kasiviswanathan):

  - ae4e30f24 and e3fa02872: If the new per-version PTE address masks for
gfx1151 are incorrect, it could result in corrupted page table entries.

4.  amdgpu_gart.c (Donet Tom):

  - ec4c462e2: The updated PTE iteration grouping may be producing
incorrect page tables when combined with the new PTE mask.

Downgrading to 7.1.0-rc5 resolves the issue. Please let me know if you
require any specific debug output or further testing.

Best regards,
Jonathan

[-- Attachment #2: Type: text/html, Size: 1621 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-15 12:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12 13:12 Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151) Jonathan L.
2026-06-15 12:01 ` Timur Kristóf
2026-06-15 12:39   ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.