* Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)
@ 2026-06-12 13:12 Jonathan L.
2026-06-15 12:01 ` Timur Kristóf
0 siblings, 1 reply; 3+ messages in thread
From: Jonathan L. @ 2026-06-12 13:12 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, christian.koenig, Harish.Kasiviswanathan,
timur.kristof
[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]
Hi team,
I am reporting a regression in the AMDGPU driver affecting the Strix Halo
APU (Radeon 8060S, gfx1151). While everything works correctly on kernel
7.1.0-rc5, upgrading to 7.1.0-rc7 causes GPU memory operations to hang
indefinitely. This occurs during tasks like torch.empty() or model weight
transfers in ComfyUI (PyTorch 2.11.0+rocm7.13).
I have bisected the changes in drivers/gpu/drm/amd/ between rc5 and rc7 and
identified the following potential causes:
1. amdgpu_hmm.c (Christian König):
- 1c824497d: Changing the invalidate callback to wait on the VM root BO
reservation lock may be introducing a deadlock.
- 962d684b5: Moving the notifier_seq read outside the retry loop could
cause infinite retries with a stale sequence number.
- 58bafc666: Changes to userptr submission waiting.
2. gfxhub_v12_0.c (Timur Kristóf):
- 40bab7c60: The change to CRASH_ON_*_FAULT bits might be causing the GPU
to retry failed memory accesses indefinitely rather than surfacing a fault.
3. gmc_v12_0.c (Harish Kasiviswanathan):
- ae4e30f24 and e3fa02872: If the new per-version PTE address masks for
gfx1151 are incorrect, it could result in corrupted page table entries.
4. amdgpu_gart.c (Donet Tom):
- ec4c462e2: The updated PTE iteration grouping may be producing
incorrect page tables when combined with the new PTE mask.
Downgrading to 7.1.0-rc5 resolves the issue. Please let me know if you
require any specific debug output or further testing.
Best regards,
Jonathan
[-- Attachment #2: Type: text/html, Size: 1621 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)
2026-06-12 13:12 Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151) Jonathan L.
@ 2026-06-15 12:01 ` Timur Kristóf
2026-06-15 12:39 ` Christian König
0 siblings, 1 reply; 3+ messages in thread
From: Timur Kristóf @ 2026-06-15 12:01 UTC (permalink / raw)
To: amd-gfx, Jonathan L.
Cc: Alexander.Deucher, christian.koenig, Harish.Kasiviswanathan
On Friday, June 12, 2026 3:12:29 PM Central European Summer Time Jonathan L.
wrote:
> Hi team,
>
> I am reporting a regression in the AMDGPU driver affecting the Strix Halo
> APU (Radeon 8060S, gfx1151). While everything works correctly on kernel
> 7.1.0-rc5, upgrading to 7.1.0-rc7 causes GPU memory operations to hang
> indefinitely. This occurs during tasks like torch.empty() or model weight
> transfers in ComfyUI (PyTorch 2.11.0+rocm7.13).
>
> I have bisected the changes in drivers/gpu/drm/amd/ between rc5 and rc7 and
> identified the following potential causes:
Hi Jonathan,
Can you please bisect which of those four patches causes your issue?
Thanks,
Timur
>
> 1. amdgpu_hmm.c (Christian König):
>
> - 1c824497d: Changing the invalidate callback to wait on the VM root BO
> reservation lock may be introducing a deadlock.
> - 962d684b5: Moving the notifier_seq read outside the retry loop could
> cause infinite retries with a stale sequence number.
> - 58bafc666: Changes to userptr submission waiting.
>
> 2. gfxhub_v12_0.c (Timur Kristóf):
>
> - 40bab7c60: The change to CRASH_ON_*_FAULT bits might be causing the GPU
> to retry failed memory accesses indefinitely rather than surfacing a fault.
Your Strix Halo chip has a GFX11.5 core which uses gfxhub_v11_5.c
Changes to gfxhub_v12_0.c will not affect your chip.
Note that retry faults are not enabled on Strix Halo by default, and don't
behave the way you described.
>
> 3. gmc_v12_0.c (Harish Kasiviswanathan):
>
> - ae4e30f24 and e3fa02872: If the new per-version PTE address masks for
> gfx1151 are incorrect, it could result in corrupted page table entries.
>
> 4. amdgpu_gart.c (Donet Tom):
>
> - ec4c462e2: The updated PTE iteration grouping may be producing
> incorrect page tables when combined with the new PTE mask.
>
> Downgrading to 7.1.0-rc5 resolves the issue. Please let me know if you
> require any specific debug output or further testing.
>
> Best regards,
> Jonathan
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)
2026-06-15 12:01 ` Timur Kristóf
@ 2026-06-15 12:39 ` Christian König
0 siblings, 0 replies; 3+ messages in thread
From: Christian König @ 2026-06-15 12:39 UTC (permalink / raw)
To: Timur Kristóf, amd-gfx, Jonathan L.
Cc: Alexander.Deucher, Harish.Kasiviswanathan
On 6/15/26 14:01, Timur Kristóf wrote:
> On Friday, June 12, 2026 3:12:29 PM Central European Summer Time Jonathan L.
> wrote:
>> Hi team,
>>
>> I am reporting a regression in the AMDGPU driver affecting the Strix Halo
>> APU (Radeon 8060S, gfx1151). While everything works correctly on kernel
>> 7.1.0-rc5, upgrading to 7.1.0-rc7 causes GPU memory operations to hang
>> indefinitely. This occurs during tasks like torch.empty() or model weight
>> transfers in ComfyUI (PyTorch 2.11.0+rocm7.13).
>>
>> I have bisected the changes in drivers/gpu/drm/amd/ between rc5 and rc7 and
>> identified the following potential causes:
>
> Hi Jonathan,
>
> Can you please bisect which of those four patches causes your issue?
>
> Thanks,
> Timur
>
>>
>> 1. amdgpu_hmm.c (Christian König):
>>
>> - 1c824497d: Changing the invalidate callback to wait on the VM root BO
>> reservation lock may be introducing a deadlock.
No, that was done before anyway. Just with a different BO.
>> - 962d684b5: Moving the notifier_seq read outside the retry loop could
>> cause infinite retries with a stale sequence number.
That was indeed an issue but should be fixed on amd-staging-drm-next. Can you re-test with that branch?
Thanks,
Christian.
>> - 58bafc666: Changes to userptr submission waiting.
>>
>> 2. gfxhub_v12_0.c (Timur Kristóf):
>>
>> - 40bab7c60: The change to CRASH_ON_*_FAULT bits might be causing the GPU
>> to retry failed memory accesses indefinitely rather than surfacing a fault.
>
> Your Strix Halo chip has a GFX11.5 core which uses gfxhub_v11_5.c
> Changes to gfxhub_v12_0.c will not affect your chip.
>
> Note that retry faults are not enabled on Strix Halo by default, and don't
> behave the way you described.
>
>>
>> 3. gmc_v12_0.c (Harish Kasiviswanathan):
>>
>> - ae4e30f24 and e3fa02872: If the new per-version PTE address masks for
>> gfx1151 are incorrect, it could result in corrupted page table entries.
>>
>> 4. amdgpu_gart.c (Donet Tom):
>>
>> - ec4c462e2: The updated PTE iteration grouping may be producing
>> incorrect page tables when combined with the new PTE mask.
>>
>> Downgrading to 7.1.0-rc5 resolves the issue. Please let me know if you
>> require any specific debug output or further testing.
>>
>> Best regards,
>> Jonathan
>
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-15 12:39 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12 13:12 Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151) Jonathan L.
2026-06-15 12:01 ` Timur Kristóf
2026-06-15 12:39 ` Christian König
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.