Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)
@ 2026-06-12 13:12 Jonathan L.
  2026-06-15 12:01 ` Timur Kristóf
  0 siblings, 1 reply; 3+ messages in thread
From: Jonathan L. @ 2026-06-12 13:12 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Harish.Kasiviswanathan,
	timur.kristof

[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]

Hi team,

I am reporting a regression in the AMDGPU driver affecting the Strix Halo
APU (Radeon 8060S, gfx1151). While everything works correctly on kernel
7.1.0-rc5, upgrading to 7.1.0-rc7 causes GPU memory operations to hang
indefinitely. This occurs during tasks like torch.empty() or model weight
transfers in ComfyUI (PyTorch 2.11.0+rocm7.13).

I have bisected the changes in drivers/gpu/drm/amd/ between rc5 and rc7 and
identified the following potential causes:

1.  amdgpu_hmm.c (Christian König):

  - 1c824497d: Changing the invalidate callback to wait on the VM root BO
reservation lock may be introducing a deadlock.
  - 962d684b5: Moving the notifier_seq read outside the retry loop could
cause infinite retries with a stale sequence number.
  - 58bafc666: Changes to userptr submission waiting.

2.  gfxhub_v12_0.c (Timur Kristóf):

  - 40bab7c60: The change to CRASH_ON_*_FAULT bits might be causing the GPU
to retry failed memory accesses indefinitely rather than surfacing a fault.

3.  gmc_v12_0.c (Harish Kasiviswanathan):

  - ae4e30f24 and e3fa02872: If the new per-version PTE address masks for
gfx1151 are incorrect, it could result in corrupted page table entries.

4.  amdgpu_gart.c (Donet Tom):

  - ec4c462e2: The updated PTE iteration grouping may be producing
incorrect page tables when combined with the new PTE mask.

Downgrading to 7.1.0-rc5 resolves the issue. Please let me know if you
require any specific debug output or further testing.

Best regards,
Jonathan

[-- Attachment #2: Type: text/html, Size: 1621 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)
  2026-06-12 13:12 Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151) Jonathan L.
@ 2026-06-15 12:01 ` Timur Kristóf
  2026-06-15 12:39   ` Christian König
  0 siblings, 1 reply; 3+ messages in thread
From: Timur Kristóf @ 2026-06-15 12:01 UTC (permalink / raw)
  To: amd-gfx, Jonathan L.
  Cc: Alexander.Deucher, christian.koenig, Harish.Kasiviswanathan

On Friday, June 12, 2026 3:12:29 PM Central European Summer Time Jonathan L. 
wrote:
> Hi team,
> 
> I am reporting a regression in the AMDGPU driver affecting the Strix Halo
> APU (Radeon 8060S, gfx1151). While everything works correctly on kernel
> 7.1.0-rc5, upgrading to 7.1.0-rc7 causes GPU memory operations to hang
> indefinitely. This occurs during tasks like torch.empty() or model weight
> transfers in ComfyUI (PyTorch 2.11.0+rocm7.13).
> 
> I have bisected the changes in drivers/gpu/drm/amd/ between rc5 and rc7 and
> identified the following potential causes:

Hi Jonathan,

Can you please bisect which of those four patches causes your issue?

Thanks,
Timur

> 
> 1.  amdgpu_hmm.c (Christian König):
> 
>   - 1c824497d: Changing the invalidate callback to wait on the VM root BO
> reservation lock may be introducing a deadlock.
>   - 962d684b5: Moving the notifier_seq read outside the retry loop could
> cause infinite retries with a stale sequence number.
>   - 58bafc666: Changes to userptr submission waiting.
> 
> 2.  gfxhub_v12_0.c (Timur Kristóf):
> 
>   - 40bab7c60: The change to CRASH_ON_*_FAULT bits might be causing the GPU
> to retry failed memory accesses indefinitely rather than surfacing a fault.

Your Strix Halo chip has a GFX11.5 core which uses gfxhub_v11_5.c
Changes to gfxhub_v12_0.c will not affect your chip.

Note that retry faults are not enabled on Strix Halo by default, and don't 
behave the way you described.

> 
> 3.  gmc_v12_0.c (Harish Kasiviswanathan):
> 
>   - ae4e30f24 and e3fa02872: If the new per-version PTE address masks for
> gfx1151 are incorrect, it could result in corrupted page table entries.
> 
> 4.  amdgpu_gart.c (Donet Tom):
> 
>   - ec4c462e2: The updated PTE iteration grouping may be producing
> incorrect page tables when combined with the new PTE mask.
> 
> Downgrading to 7.1.0-rc5 resolves the issue. Please let me know if you
> require any specific debug output or further testing.
> 
> Best regards,
> Jonathan




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)
  2026-06-15 12:01 ` Timur Kristóf
@ 2026-06-15 12:39   ` Christian König
  0 siblings, 0 replies; 3+ messages in thread
From: Christian König @ 2026-06-15 12:39 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Jonathan L.
  Cc: Alexander.Deucher, Harish.Kasiviswanathan



On 6/15/26 14:01, Timur Kristóf wrote:
> On Friday, June 12, 2026 3:12:29 PM Central European Summer Time Jonathan L. 
> wrote:
>> Hi team,
>>
>> I am reporting a regression in the AMDGPU driver affecting the Strix Halo
>> APU (Radeon 8060S, gfx1151). While everything works correctly on kernel
>> 7.1.0-rc5, upgrading to 7.1.0-rc7 causes GPU memory operations to hang
>> indefinitely. This occurs during tasks like torch.empty() or model weight
>> transfers in ComfyUI (PyTorch 2.11.0+rocm7.13).
>>
>> I have bisected the changes in drivers/gpu/drm/amd/ between rc5 and rc7 and
>> identified the following potential causes:
> 
> Hi Jonathan,
> 
> Can you please bisect which of those four patches causes your issue?
> 
> Thanks,
> Timur
> 
>>
>> 1.  amdgpu_hmm.c (Christian König):
>>
>>   - 1c824497d: Changing the invalidate callback to wait on the VM root BO
>> reservation lock may be introducing a deadlock.

No, that was done before anyway. Just with a different BO.

>>   - 962d684b5: Moving the notifier_seq read outside the retry loop could
>> cause infinite retries with a stale sequence number.

That was indeed an issue but should be fixed on amd-staging-drm-next. Can you re-test with that branch?

Thanks,
Christian.


>>   - 58bafc666: Changes to userptr submission waiting.
>>
>> 2.  gfxhub_v12_0.c (Timur Kristóf):
>>
>>   - 40bab7c60: The change to CRASH_ON_*_FAULT bits might be causing the GPU
>> to retry failed memory accesses indefinitely rather than surfacing a fault.
> 
> Your Strix Halo chip has a GFX11.5 core which uses gfxhub_v11_5.c
> Changes to gfxhub_v12_0.c will not affect your chip.
> 
> Note that retry faults are not enabled on Strix Halo by default, and don't 
> behave the way you described.
> 
>>
>> 3.  gmc_v12_0.c (Harish Kasiviswanathan):
>>
>>   - ae4e30f24 and e3fa02872: If the new per-version PTE address masks for
>> gfx1151 are incorrect, it could result in corrupted page table entries.
>>
>> 4.  amdgpu_gart.c (Donet Tom):
>>
>>   - ec4c462e2: The updated PTE iteration grouping may be producing
>> incorrect page tables when combined with the new PTE mask.
>>
>> Downgrading to 7.1.0-rc5 resolves the issue. Please let me know if you
>> require any specific debug output or further testing.
>>
>> Best regards,
>> Jonathan
> 
> 
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-15 12:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12 13:12 Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151) Jonathan L.
2026-06-15 12:01 ` Timur Kristóf
2026-06-15 12:39   ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.