From: "Timur Kristóf" <timur.kristof@gmail.com>
To: amd-gfx@lists.freedesktop.org, "Jonathan L." <jonaphin@gmail.com>
Cc: Alexander.Deucher@amd.com, christian.koenig@amd.com,
Harish.Kasiviswanathan@amd.com
Subject: Re: Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151)
Date: Mon, 15 Jun 2026 14:01:27 +0200 [thread overview]
Message-ID: <4898269.vXUDI8C0e8@timur-hyperion> (raw)
In-Reply-To: <CAPs_=oyWXJuQ4VS7LNQgM=jGKPMDoj1OnagU9YNQY+7aEeabHw@mail.gmail.com>
On Friday, June 12, 2026 3:12:29 PM Central European Summer Time Jonathan L.
wrote:
> Hi team,
>
> I am reporting a regression in the AMDGPU driver affecting the Strix Halo
> APU (Radeon 8060S, gfx1151). While everything works correctly on kernel
> 7.1.0-rc5, upgrading to 7.1.0-rc7 causes GPU memory operations to hang
> indefinitely. This occurs during tasks like torch.empty() or model weight
> transfers in ComfyUI (PyTorch 2.11.0+rocm7.13).
>
> I have bisected the changes in drivers/gpu/drm/amd/ between rc5 and rc7 and
> identified the following potential causes:
Hi Jonathan,
Can you please bisect which of those four patches causes your issue?
Thanks,
Timur
>
> 1. amdgpu_hmm.c (Christian König):
>
> - 1c824497d: Changing the invalidate callback to wait on the VM root BO
> reservation lock may be introducing a deadlock.
> - 962d684b5: Moving the notifier_seq read outside the retry loop could
> cause infinite retries with a stale sequence number.
> - 58bafc666: Changes to userptr submission waiting.
>
> 2. gfxhub_v12_0.c (Timur Kristóf):
>
> - 40bab7c60: The change to CRASH_ON_*_FAULT bits might be causing the GPU
> to retry failed memory accesses indefinitely rather than surfacing a fault.
Your Strix Halo chip has a GFX11.5 core which uses gfxhub_v11_5.c
Changes to gfxhub_v12_0.c will not affect your chip.
Note that retry faults are not enabled on Strix Halo by default, and don't
behave the way you described.
>
> 3. gmc_v12_0.c (Harish Kasiviswanathan):
>
> - ae4e30f24 and e3fa02872: If the new per-version PTE address masks for
> gfx1151 are incorrect, it could result in corrupted page table entries.
>
> 4. amdgpu_gart.c (Donet Tom):
>
> - ec4c462e2: The updated PTE iteration grouping may be producing
> incorrect page tables when combined with the new PTE mask.
>
> Downgrading to 7.1.0-rc5 resolves the issue. Please let me know if you
> require any specific debug output or further testing.
>
> Best regards,
> Jonathan
next prev parent reply other threads:[~2026-06-15 12:01 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 13:12 Linux 7.1-rc7 regression — ROCm GPU memory ops hang on Strix Halo (gfx1151) Jonathan L.
2026-06-15 12:01 ` Timur Kristóf [this message]
2026-06-15 12:39 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4898269.vXUDI8C0e8@timur-hyperion \
--to=timur.kristof@gmail.com \
--cc=Alexander.Deucher@amd.com \
--cc=Harish.Kasiviswanathan@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=jonaphin@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.