From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 869D9D3B995 for ; Wed, 10 Dec 2025 02:02:41 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id F3DF810E5FC; Wed, 10 Dec 2025 02:02:40 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; secure) header.d=lhjonline.com header.i=@lhjonline.com header.b="TXxr+qKl"; dkim-atps=neutral X-Greylist: delayed 361 seconds by postgrey-1.36 at gabe; Wed, 10 Dec 2025 02:02:39 UTC Received: from zmailpro.com (zmailpro.com [174.129.240.60]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4F69910E5FC for ; Wed, 10 Dec 2025 02:02:39 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by zmailpro.com (Postfix) with ESMTP id D819689E321 for ; Tue, 9 Dec 2025 20:56:37 -0500 (EST) Received: from zmailpro.com ([127.0.0.1]) by localhost (zmailpro.com [127.0.0.1]) (amavis, port 10032) with ESMTP id yEZoyOhg_Ydz for ; Tue, 9 Dec 2025 20:56:36 -0500 (EST) Received: from localhost (localhost.localdomain [127.0.0.1]) by zmailpro.com (Postfix) with ESMTP id A9AD189E341 for ; Tue, 9 Dec 2025 20:56:36 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 zmailpro.com A9AD189E341 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lhjonline.com; s=22187CEA-9A4A-11E6-98C3-D49722DE3479; t=1765331796; bh=5nv43zdoGxwMNPoqG6Busm13oZu3lyu4cByjaTf9gx8=; h=Date:From:To:Message-ID:MIME-Version; b=TXxr+qKl/QJJd/L0GfRzAHMti/bZzBN1gTLQ2dfKagQba26hj7/UFp4hg0GbIVa7A MPR4EejWpZiEn9uFZXJSHzv7D+kP7e91wwqDvp893cgKgRMOBhMNXTKalJecfHbiMj paOlb2ix4T3YYItyMbACNjLFPMdS0LGBoW2hG8VVGObeN923dOGQuKGnq1Wq3zBu6H d4y9ks86G+nd2XA30pWrr2lnsmmrGhNhf1Mphv1a7G0+qmSAt7sOQ/i5/3TUZIiMzD pG+p72d0D9dp0oUmy2SwTBc/VLVqSVjRXBl3j8Ryrwc9lEz5ska2BhihLDqzIruLss 2J00+j3sY1MfA== X-Virus-Scanned: amavis at zmailpro.com Received: from zmailpro.com ([127.0.0.1]) by localhost (zmailpro.com [127.0.0.1]) (amavis, port 10026) with ESMTP id e6ygw5cLSoNL for ; Tue, 9 Dec 2025 20:56:36 -0500 (EST) Received: from zmailpro.com (zmailpro.com [174.129.240.60]) by zmailpro.com (Postfix) with ESMTP id 8670B89E31B for ; Tue, 9 Dec 2025 20:56:36 -0500 (EST) Date: Tue, 9 Dec 2025 20:56:35 -0500 (EST) From: Harris Landgarten To: amd-gfx@lists.freedesktop.org Message-ID: <1810124938.51173.1765331795979.JavaMail.zimbra@lhjonline.com> Subject: =?utf-8?Q?Subject:_gfx1150_/_Radeon_890M_=E2=80=93_MES_s?= =?utf-8?Q?cheduler_wedge_under_sustained_compute_?= =?utf-8?Q?(Ring_13_stalls,_GPU_unrecoverable;_res?= =?utf-8?Q?et_triggers_power-off)_on_kernel-6.18.0?= MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [174.129.240.60] X-Mailer: Zimbra 9.0.0_ZEXTRAS_20240927 (ZimbraWebClient - GC144 (Linux)/9.0.0_ZEXTRAS_20240927) Thread-Index: SKMl78hy58OFLA/DdFr9xVtEC4vWDw== Thread-Topic: Subject: gfx1150 / Radeon 890M =?utf-8?B?4oCTIE1FUw==?= scheduler wedge under sustained compute (Ring 13 stalls, GPU unrecoverable; reset triggers power-off) on kernel-6.18.0 X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" Hello AMD GPU team, I am reporting a reproducible issue on Strix Point (gfx1150, Radeon 890M) where the MES scheduler (Ring 13) wedges under sustained compute load. Once this occurs, the GPU becomes unrecoverable from userspace, and the system becomes unable to shut down cleanly. This happens reliably during extended machine-learning training workloads. Hardware / software environment: -------------------------------- =E2=80=A2 GPU: Strix [Radeon 880M / 890M], PCI ID 1002:150e =20 =E2=80=A2 CPU: AMD Ryzen AI 9 HX 370 =20 =E2=80=A2 OS: Gentoo Linux (fully updated) =20 =E2=80=A2 Kernel: 6.18.0-gentoo-x86_64 =20 =E2=80=A2 linux-firmware version: linux-firmware-20251125_p20251203 =20 =E2=80=A2 GPU IFWI: 113-STRIXEMU-001, version 00107777 =20 Important notes about firmware authenticity: -------------------------------------------- =E2=80=A2 The GPU firmware in the linux-firmware package is **unmodified AM= D- provided firmware**. =20 =E2=80=A2 Gentoo does *not* patch amdgpu firmware=E2=80=94if it loaded succ= essfully, it is the exact AMD-signed blob. =20 =E2=80=A2 All GPU firmware components load with valid signatures. =20 =E2=80=A2 There is a separate Gentoo CPU =E2=80=9Camd-ucode=E2=80=9D signat= ure warning, but that is unrelated; GPU firmware loads cleanly. =20 =E2=80=A2 I am therefore already running the *exact* firmware AMD intends f= or gfx1150. If a recent microcode or kernel fix was supposed to address MES issues, this confirms the problem persists. Description of the failure: --------------------------- After many hours of sustained GPU compute (deep-learning training), The driver begins logging Ring 13 / MES errors. Once the first real-time Scheduler message failure appears, the GPU quickly becomes unresponsive: =E2=80=A2 amdgpu stops accepting messages =20 =E2=80=A2 Ring 13 reports "MES buffer full" and/or stops progressing =20 =E2=80=A2 display session dies and triggers emergency logout =20 =E2=80=A2 attempting to restart GDM usually causes an immediate reboot =20 =E2=80=A2 The system cannot complete a clean shutdown once MES is wedged = =20 Critically, `amd-smi reset --gpureset -g 0` *almost* works=E2=80=94it reset= s GFX momentarily, GNOME becomes responsive for ~30 seconds, but the GPU then forces a power-off shortly afterward. This suggests the MES block remains in a corrupted state and cannot be reinitialized safely. This behavior is consistent and fully reproducible under heavy compute. `amd-smi` diagnostic data: -------------------------- The following `amd-smi` outputs were captured **while training was running*= * (i.e., during a real workload, not idle). These represent true runtime conditions rather than idle reporting. Examples: `amd-smi static -g all`: =E2=80=A2 correctly reports gfx1150, 16 CUs, 4096 MB VRAM, clocks, caches= , etc. =E2=80=A2 IFWI package 113-STRIXEMU-001 v00107777 is present =E2=80=A2 firmware components (CP_PFP, CP_ME, MEC1, RLC, SDMA, VCN, ASD, = PM) all load with valid versions `amd-smi metric -g all`: =E2=80=A2 reports training-phase VRAM usage (approx. 2.4 GB of 4 GB) =E2=80=A2 temperatures are normal =E2=80=A2 most advanced metrics for this iGPU are "N/A" =E2=80=A2 system fully responsive before the wedge `amd-smi list -g all`: =E2=80=A2 GPU 0 at BDF 0000:c5:00.0 with UUID 00ff150e-0000-1000-8000-000= 000000000 Impact: ------- Once Ring 13 wedges, the GPU cannot: =E2=80=A2 process new compute work =20 =E2=80=A2 process display work reliably =20 =E2=80=A2 accept amdgpu reset =20 =E2=80=A2 shut the system down cleanly =20 This appears to be a MES firmware or driver scheduling issue specific to gfx1150 under long-duration compute loads. What I can provide: ------------------- =E2=80=A2 Full logs including `dmesg`, journal slices, and fence dump snaps= hots =20 =E2=80=A2 Exact reproduction steps (training runs 10=E2=80=9320 hours, alwa= ys reproduces) =20 =E2=80=A2 Willingness to test: =E2=80=93 patched kernels =E2=80=93 updated MES firmware =E2=80=93 new linux-firmware bundles =E2=80=93 debug patches or instrumentation =20 =E2=80=A2 Any additional diagnostic commands you require =20 Please let me know what additional information would be most useful. Thank you for your time, and I=E2=80=99m happy to assist further with debug= ging gfx1150 stability under sustained compute load. https://bugs.gentoo.org/967078 references this issue with attachments Harris Landgarten 516 643-1286