Re: Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Timur Kristóf" <timur.kristof@gmail.com>
To: amd-gfx@lists.freedesktop.org, "Jonathan L." <jonaphin@gmail.com>
Cc: Alexander.Deucher@amd.com, christian.koenig@amd.com,
	Harish.Kasiviswanathan@amd.com
Subject: Re: Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)
Date: Mon, 15 Jun 2026 14:01:27 +0200	[thread overview]
Message-ID: <4898269.vXUDI8C0e8@timur-hyperion> (raw)
In-Reply-To: <CAPs_=oyWXJuQ4VS7LNQgM=jGKPMDoj1OnagU9YNQY+7aEeabHw@mail.gmail.com>

On Friday, June 12, 2026 3:12:29 PM Central European Summer Time Jonathan L. 
wrote:
> Hi team,
> 
> I am reporting a regression in the AMDGPU driver affecting the Strix Halo
> APU (Radeon 8060S, gfx1151). While everything works correctly on kernel
> 7.1.0-rc5, upgrading to 7.1.0-rc7 causes GPU memory operations to hang
> indefinitely. This occurs during tasks like torch.empty() or model weight
> transfers in ComfyUI (PyTorch 2.11.0+rocm7.13).
> 
> I have bisected the changes in drivers/gpu/drm/amd/ between rc5 and rc7 and
> identified the following potential causes:

Hi Jonathan,

Can you please bisect which of those four patches causes your issue?

Thanks,
Timur

> 
> 1.  amdgpu_hmm.c (Christian König):
> 
>   - 1c824497d: Changing the invalidate callback to wait on the VM root BO
> reservation lock may be introducing a deadlock.
>   - 962d684b5: Moving the notifier_seq read outside the retry loop could
> cause infinite retries with a stale sequence number.
>   - 58bafc666: Changes to userptr submission waiting.
> 
> 2.  gfxhub_v12_0.c (Timur Kristóf):
> 
>   - 40bab7c60: The change to CRASH_ON_*_FAULT bits might be causing the GPU
> to retry failed memory accesses indefinitely rather than surfacing a fault.

Your Strix Halo chip has a GFX11.5 core which uses gfxhub_v11_5.c
Changes to gfxhub_v12_0.c will not affect your chip.

Note that retry faults are not enabled on Strix Halo by default, and don't 
behave the way you described.

> 
> 3.  gmc_v12_0.c (Harish Kasiviswanathan):
> 
>   - ae4e30f24 and e3fa02872: If the new per-version PTE address masks for
> gfx1151 are incorrect, it could result in corrupted page table entries.
> 
> 4.  amdgpu_gart.c (Donet Tom):
> 
>   - ec4c462e2: The updated PTE iteration grouping may be producing
> incorrect page tables when combined with the new PTE mask.
> 
> Downgrading to 7.1.0-rc5 resolves the issue. Please let me know if you
> require any specific debug output or further testing.
> 
> Best regards,
> Jonathan

next prev parent reply	other threads:[~2026-06-15 12:01 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12 13:12 Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151) Jonathan L.
2026-06-15 12:01 ` Timur Kristóf [this message]
2026-06-15 12:39   ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4898269.vXUDI8C0e8@timur-hyperion \
    --to=timur.kristof@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Harish.Kasiviswanathan@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=jonaphin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.