From: Bert Karwatzki <spasswolf@web.de>
To: Alex Deucher <alexander.deucher@amd.com>
Cc: Bert Karwatzki <spasswolf@web.de>,
linux-kernel@vger.kernel.org, amd-gfx@lists.freedesktop.org,
linux-next@vger.kernel.org, Jesse Zhang <jesse.zhang@amd.com>,
Amber Lin <Amber.Lin@amd.com>,
Mario Limonciello <mario.limonciello@amd.com>
Subject: GPU reset when running the ROCm hsa runtime tests on gfx12 and next-20260701
Date: Fri, 3 Jul 2026 14:44:03 +0200 [thread overview]
Message-ID: <20260703124405.56248-1-spasswolf@web.de> (raw)
When running the ROCm hsa selftests on debian unstable (with ROCm packages from experimental)
with linux next-20260701, the GPU will reset with the following error message:
$ /usr/libexec/rocm/libhsa-runtime64-tests/run-tests
[ T2548] amdgpu 0000:03:00.0: failed to suspend gangs from MES
[ T2548] amdgpu 0000:03:00.0: MES might be in unrecoverable state, issue a GPU reset
[ T360] amdgpu 0000:03:00.0: GPU reset begin!. Source: 3
[T28252] amdgpu 0000:03:00.0: Failed to evict queue 0
[T28252] amdgpu: Failed to quiesce KFD
[ T360] amdgpu 0000:03:00.0: Dumping IP State
[ T360] amdgpu 0000:03:00.0: Dumping IP State Completed
[ T360] amdgpu 0000:03:00.0: MODE1 reset
[ T360] amdgpu 0000:03:00.0: GPU mode1 reset
[ T360] amdgpu 0000:03:00.0: GPU smu mode1 reset
[ T360] amdgpu 0000:03:00.0: GPU reset succeeded, trying to resume
[drm] PCIE GART of 512M enabled (table at 0x00000083DAB00000).
[drm] AMDGPU device coredump file has been created
[drm] Check your /sys/class/drm/card0/device/devcoredump/data
[ T360] amdgpu 0000:03:00.0: VRAM is lost due to GPU reset!
[ T360] amdgpu 0000:03:00.0: PSP is resuming...
[ T360] amdgpu 0000:03:00.0: RAS: optional ras ta ucode is not available
[ T360] amdgpu 0000:03:00.0: RAP: optional rap ta ucode is not available
[ T360] amdgpu 0000:03:00.0: SECUREDISPLAY: optional securedisplay ta ucode is not available
[ T360] amdgpu 0000:03:00.0: SMU is resuming...
[ T360] amdgpu 0000:03:00.0: SMU is resumed successfully!
[ T360] amdgpu 0000:03:00.0: program CP_MES_CNTL : 0x4000000
[ T360] amdgpu 0000:03:00.0: program CP_MES_CNTL : 0xc000000
[drm] DMUB hardware initialized: version=0x00010300
[ T360] amdgpu 0000:03:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ T360] amdgpu 0000:03:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ T360] amdgpu 0000:03:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ T360] amdgpu 0000:03:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ T360] amdgpu 0000:03:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ T360] amdgpu 0000:03:00.0: ring sdma0 uses VM inv eng 9 on hub 0
[ T360] amdgpu 0000:03:00.0: ring sdma1 uses VM inv eng 10 on hub 0
[ T360] amdgpu 0000:03:00.0: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ T360] amdgpu 0000:03:00.0: ring jpeg_dec uses VM inv eng 1 on hub 8
[ T360] amdgpu 0000:03:00.0: GPU reset(1) succeeded!
[drm] device wedged, but no recovery needed
[ T2050] amdgpu 0000:03:00.0: VM memory stats for proc xfwm4(2105) task xfwm4:cs0(2050) is non-zero when fini
[ T1932] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T2293] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T1929] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T1930] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T2263] amdgpu 0000:03:00.0: VM memory stats for proc firefox-esr(3483) task firefox-es:cs0(2263) is non-zero when fini
[ T3590] amdgpu 0000:03:00.0: VM memory stats for proc RDD Process(4755) task firefox-es:cs0(3590) is non-zero when fini
[ T1469] amdgpu 0000:03:00.0: VM memory stats for proc Xorg(1495) task Xorg:cs0(1469) is non-zero when fini
As this does not occur on v7.2-rc1 I bisected the issue and found
commit f94bbd648bb4 ("drm/amdgpu: use a single entry point for mes compute reset")
as responsible.
Unfortunately the does not revert cleanly in linux-next-20260701, so I had to revert these
commits to fix the issue in next-20260701:
b789664e3e30 ("drm/amdkfd: Clean up suspend_all and resume_all mes")
a665d09b10af ("drm/amdkfd: Pass known bad queue info to reset")
a4e4d945cba8 ("drm/amdgpu/gfx: defer per-queue helper_end until after MES resume")
f401a2633e02 ("drm/amdgpu: Remove faulty queue before resume")
f94bbd648bb4 ("drm/amdgpu: use a single entry point for mes compute reset")
GPU used:
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 44 [Radeon RX 9060 XT] [1002:7590] (rev c0)
CPU used:
model name : AMD Ryzen 9 9950X 16-Core Processor
Installed ROCm packages (from debian experimental)
bert@homer:~$ dpkg -l | grep 7.2.4
ii hip-utils 7.2.4-1~exp1 amd64 utilities for HIP language development
ii hipcc 7.2.4+dfsg-1~exp1 amd64 C++ Runtime API and Kernel Language for AMD and NVIDIA GPUs
ii libamd-comgr-dev 7.2.4+dfsg-1~exp1 amd64 ROCm code object manager - development package
ii libamd-comgr3:amd64 7.2.4+dfsg-1~exp1 amd64 ROCm code object manager
ii libamdhip64-7:amd64 7.2.4-1~exp1 amd64 HIP runtime for AMD GPUs - library
ii libamdhip64-dev 7.2.4-1~exp1 amd64 HIP runtime for AMD GPUs - headers
ii libhiprtc-builtins7:amd64 7.2.4-1~exp1 amd64 HIP runtime compilation library - builtins
ii libhiprtc7:amd64 7.2.4-1~exp1 amd64 HIP runtime compilation library
ii libhsa-runtime-dev:amd64 7.2.4+dfsg-1~exp1 amd64 HSA Runtime API and runtime for ROCm - development files
ii libhsa-runtime64-1:amd64 7.2.4+dfsg-1~exp1 amd64 HSA Runtime API and runtime for ROCm
ii libhsa-runtime64-tests 7.2.4+dfsg-1~exp1 amd64 HSA Runtime test suites for ROCm
ii librocblas-dev 7.2.4-1~exp1 amd64 ROCm library for basic linear algebra - headers
ii librocblas-doc 7.2.4-1~exp1 all ROCm library for basic linear algebra - documentation
ii librocblas5 7.2.4-1~exp1 amd64 ROCm library for basic linear algebra - library
ii librocblas5-bench 7.2.4-1~exp1 amd64 ROCm library for basic linear algebra - benchmarks
ii librocblas5-tests 7.2.4-1~exp1 amd64 ROCm library for basic linear algebra - tests
ii librocblas5-tests-data 7.2.4-1~exp1 all ROCm library for basic linear algebra - test data
ii librocm-core-dev 7.2.4-1~exp1 amd64 provides methods to get information about installed ROCm - headers
ii librocm-core1 7.2.4-1~exp1 amd64 provides methods to get information about installed ROCm
ii librocm-smi64-7 7.2.4-1~exp1 amd64 ROCm System Management Interface (ROCm SMI) library
ii librocsolver-dev 7.2.4-1~exp1 amd64 ROCm library for numerical linear algebra - headers
ii librocsolver0 7.2.4-1~exp1 amd64 ROCm library for numerical linear algebra - library
ii librocsolver0-tests 7.2.4-1~exp1 amd64 ROCm library for numerical linear algebra - tests
ii libroctx64-4 7.2.4-1~exp1 amd64 ROCm library for profiler annotations - library
ii rocm-device-libs-22 7.2.4+dfsg-1~exp1 amd64 AMD specific device-side language runtime libraries
ii rocm-opencl-icd:amd64 7.2.4-1~exp1 amd64 AMD ROCm implementation of the OpenCL API - ICD runtime
ii rocminfo 7.2.4-1~exp1 amd64 ROCm Application for Reporting System Info
Bert Karwatzki
reply other threads:[~2026-07-03 12:44 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260703124405.56248-1-spasswolf@web.de \
--to=spasswolf@web.de \
--cc=Amber.Lin@amd.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=jesse.zhang@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-next@vger.kernel.org \
--cc=mario.limonciello@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox