Subject: gfx1150 / Radeon 890M – MES scheduler wedge under sustained compute (Ring 13 stalls, GPU unrecoverable; reset triggers power-off) on kernel-6.18.0

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Harris Landgarten <harrisl@lhjonline.com>
To: amd-gfx@lists.freedesktop.org
Subject: Subject: gfx1150 / Radeon 890M – MES scheduler wedge under sustained compute (Ring 13 stalls, GPU unrecoverable; reset triggers power-off) on kernel-6.18.0
Date: Tue, 9 Dec 2025 20:56:35 -0500 (EST)	[thread overview]
Message-ID: <1810124938.51173.1765331795979.JavaMail.zimbra@lhjonline.com> (raw)

Hello AMD GPU team,

I am reporting a reproducible issue on Strix Point (gfx1150, Radeon 890M)
where the MES scheduler (Ring 13) wedges under sustained compute load.
Once this occurs, the GPU becomes unrecoverable from userspace, and the
system becomes unable to shut down cleanly. This happens reliably during
extended machine-learning training workloads.

Hardware / software environment:
--------------------------------
• GPU: Strix [Radeon 880M / 890M], PCI ID 1002:150e  
• CPU: AMD Ryzen AI 9 HX 370  
• OS: Gentoo Linux (fully updated)  
• Kernel: 6.18.0-gentoo-x86_64  
• linux-firmware version: linux-firmware-20251125_p20251203  
• GPU IFWI: 113-STRIXEMU-001, version 00107777  

Important notes about firmware authenticity:
--------------------------------------------
• The GPU firmware in the linux-firmware package is **unmodified AMD-
  provided firmware**.  
• Gentoo does *not* patch amdgpu firmware—if it loaded successfully, it is
  the exact AMD-signed blob.  
• All GPU firmware components load with valid signatures.  
• There is a separate Gentoo CPU “amd-ucode” signature warning, but that is
  unrelated; GPU firmware loads cleanly.  
• I am therefore already running the *exact* firmware AMD intends for
  gfx1150. If a recent microcode or kernel fix was supposed to address
  MES issues, this confirms the problem persists.

Description of the failure:
---------------------------
After many hours of sustained GPU compute (deep-learning training),
The driver begins logging Ring 13 / MES errors. Once the first real-time
Scheduler message failure appears, the GPU quickly becomes unresponsive:

• amdgpu stops accepting messages  
• Ring 13 reports "MES buffer full" and/or stops progressing  
• display session dies and triggers emergency logout  
• attempting to restart GDM usually causes an immediate reboot  
• The system cannot complete a clean shutdown once MES is wedged  

Critically, `amd-smi reset --gpureset -g 0` *almost* works—it resets GFX
momentarily, GNOME becomes responsive for ~30 seconds, but the GPU then
forces a power-off shortly afterward. This suggests the MES block remains
in a corrupted state and cannot be reinitialized safely.

This behavior is consistent and fully reproducible under heavy compute.

`amd-smi` diagnostic data:
--------------------------
The following `amd-smi` outputs were captured **while training was running**
(i.e., during a real workload, not idle). These represent true runtime
conditions rather than idle reporting.

Examples:

`amd-smi static -g all`:
  • correctly reports gfx1150, 16 CUs, 4096 MB VRAM, clocks, caches, etc.
  • IFWI package 113-STRIXEMU-001 v00107777 is present
  • firmware components (CP_PFP, CP_ME, MEC1, RLC, SDMA, VCN, ASD, PM)
    all load with valid versions

`amd-smi metric -g all`:
  • reports training-phase VRAM usage (approx. 2.4 GB of 4 GB)
  • temperatures are normal
  • most advanced metrics for this iGPU are "N/A"
  • system fully responsive before the wedge

`amd-smi list -g all`:
  • GPU 0 at BDF 0000:c5:00.0 with UUID 00ff150e-0000-1000-8000-000000000000

Impact:
-------
Once Ring 13 wedges, the GPU cannot:
• process new compute work  
• process display work reliably  
• accept amdgpu reset  
• shut the system down cleanly  

This appears to be a MES firmware or driver scheduling issue specific to
gfx1150 under long-duration compute loads.

What I can provide:
-------------------
• Full logs including `dmesg`, journal slices, and fence dump snapshots  
• Exact reproduction steps (training runs 10–20 hours, always reproduces)  
• Willingness to test:
    – patched kernels
    – updated MES firmware
    – new linux-firmware bundles
    – debug patches or instrumentation  
• Any additional diagnostic commands you require  

Please let me know what additional information would be most useful.

Thank you for your time, and I’m happy to assist further with debugging
gfx1150 stability under sustained compute load.

https://bugs.gentoo.org/967078 references this issue with attachments

Harris Landgarten
516 643-1286

                 reply	other threads:[~2025-12-10  2:02 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1810124938.51173.1765331795979.JavaMail.zimbra@lhjonline.com \
    --to=harrisl@lhjonline.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox