* GPU reset when running the ROCm hsa runtime tests on gfx12 and next-20260701
@ 2026-07-03 12:44 Bert Karwatzki
0 siblings, 0 replies; only message in thread
From: Bert Karwatzki @ 2026-07-03 12:44 UTC (permalink / raw)
To: Alex Deucher
Cc: Bert Karwatzki, linux-kernel, amd-gfx, linux-next, Jesse Zhang,
Amber Lin, Mario Limonciello
When running the ROCm hsa selftests on debian unstable (with ROCm packages from experimental)
with linux next-20260701, the GPU will reset with the following error message:
$ /usr/libexec/rocm/libhsa-runtime64-tests/run-tests
[ T2548] amdgpu 0000:03:00.0: failed to suspend gangs from MES
[ T2548] amdgpu 0000:03:00.0: MES might be in unrecoverable state, issue a GPU reset
[ T360] amdgpu 0000:03:00.0: GPU reset begin!. Source: 3
[T28252] amdgpu 0000:03:00.0: Failed to evict queue 0
[T28252] amdgpu: Failed to quiesce KFD
[ T360] amdgpu 0000:03:00.0: Dumping IP State
[ T360] amdgpu 0000:03:00.0: Dumping IP State Completed
[ T360] amdgpu 0000:03:00.0: MODE1 reset
[ T360] amdgpu 0000:03:00.0: GPU mode1 reset
[ T360] amdgpu 0000:03:00.0: GPU smu mode1 reset
[ T360] amdgpu 0000:03:00.0: GPU reset succeeded, trying to resume
[drm] PCIE GART of 512M enabled (table at 0x00000083DAB00000).
[drm] AMDGPU device coredump file has been created
[drm] Check your /sys/class/drm/card0/device/devcoredump/data
[ T360] amdgpu 0000:03:00.0: VRAM is lost due to GPU reset!
[ T360] amdgpu 0000:03:00.0: PSP is resuming...
[ T360] amdgpu 0000:03:00.0: RAS: optional ras ta ucode is not available
[ T360] amdgpu 0000:03:00.0: RAP: optional rap ta ucode is not available
[ T360] amdgpu 0000:03:00.0: SECUREDISPLAY: optional securedisplay ta ucode is not available
[ T360] amdgpu 0000:03:00.0: SMU is resuming...
[ T360] amdgpu 0000:03:00.0: SMU is resumed successfully!
[ T360] amdgpu 0000:03:00.0: program CP_MES_CNTL : 0x4000000
[ T360] amdgpu 0000:03:00.0: program CP_MES_CNTL : 0xc000000
[drm] DMUB hardware initialized: version=0x00010300
[ T360] amdgpu 0000:03:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ T360] amdgpu 0000:03:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ T360] amdgpu 0000:03:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ T360] amdgpu 0000:03:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ T360] amdgpu 0000:03:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ T360] amdgpu 0000:03:00.0: ring sdma0 uses VM inv eng 9 on hub 0
[ T360] amdgpu 0000:03:00.0: ring sdma1 uses VM inv eng 10 on hub 0
[ T360] amdgpu 0000:03:00.0: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ T360] amdgpu 0000:03:00.0: ring jpeg_dec uses VM inv eng 1 on hub 8
[ T360] amdgpu 0000:03:00.0: GPU reset(1) succeeded!
[drm] device wedged, but no recovery needed
[ T2050] amdgpu 0000:03:00.0: VM memory stats for proc xfwm4(2105) task xfwm4:cs0(2050) is non-zero when fini
[ T1932] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T2293] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T1929] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T1930] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T2263] amdgpu 0000:03:00.0: VM memory stats for proc firefox-esr(3483) task firefox-es:cs0(2263) is non-zero when fini
[ T3590] amdgpu 0000:03:00.0: VM memory stats for proc RDD Process(4755) task firefox-es:cs0(3590) is non-zero when fini
[ T1469] amdgpu 0000:03:00.0: VM memory stats for proc Xorg(1495) task Xorg:cs0(1469) is non-zero when fini
As this does not occur on v7.2-rc1 I bisected the issue and found
commit f94bbd648bb4 ("drm/amdgpu: use a single entry point for mes compute reset")
as responsible.
Unfortunately the does not revert cleanly in linux-next-20260701, so I had to revert these
commits to fix the issue in next-20260701:
b789664e3e30 ("drm/amdkfd: Clean up suspend_all and resume_all mes")
a665d09b10af ("drm/amdkfd: Pass known bad queue info to reset")
a4e4d945cba8 ("drm/amdgpu/gfx: defer per-queue helper_end until after MES resume")
f401a2633e02 ("drm/amdgpu: Remove faulty queue before resume")
f94bbd648bb4 ("drm/amdgpu: use a single entry point for mes compute reset")
GPU used:
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 44 [Radeon RX 9060 XT] [1002:7590] (rev c0)
CPU used:
model name : AMD Ryzen 9 9950X 16-Core Processor
Installed ROCm packages (from debian experimental)
bert@homer:~$ dpkg -l | grep 7.2.4
ii hip-utils 7.2.4-1~exp1 amd64 utilities for HIP language development
ii hipcc 7.2.4+dfsg-1~exp1 amd64 C++ Runtime API and Kernel Language for AMD and NVIDIA GPUs
ii libamd-comgr-dev 7.2.4+dfsg-1~exp1 amd64 ROCm code object manager - development package
ii libamd-comgr3:amd64 7.2.4+dfsg-1~exp1 amd64 ROCm code object manager
ii libamdhip64-7:amd64 7.2.4-1~exp1 amd64 HIP runtime for AMD GPUs - library
ii libamdhip64-dev 7.2.4-1~exp1 amd64 HIP runtime for AMD GPUs - headers
ii libhiprtc-builtins7:amd64 7.2.4-1~exp1 amd64 HIP runtime compilation library - builtins
ii libhiprtc7:amd64 7.2.4-1~exp1 amd64 HIP runtime compilation library
ii libhsa-runtime-dev:amd64 7.2.4+dfsg-1~exp1 amd64 HSA Runtime API and runtime for ROCm - development files
ii libhsa-runtime64-1:amd64 7.2.4+dfsg-1~exp1 amd64 HSA Runtime API and runtime for ROCm
ii libhsa-runtime64-tests 7.2.4+dfsg-1~exp1 amd64 HSA Runtime test suites for ROCm
ii librocblas-dev 7.2.4-1~exp1 amd64 ROCm library for basic linear algebra - headers
ii librocblas-doc 7.2.4-1~exp1 all ROCm library for basic linear algebra - documentation
ii librocblas5 7.2.4-1~exp1 amd64 ROCm library for basic linear algebra - library
ii librocblas5-bench 7.2.4-1~exp1 amd64 ROCm library for basic linear algebra - benchmarks
ii librocblas5-tests 7.2.4-1~exp1 amd64 ROCm library for basic linear algebra - tests
ii librocblas5-tests-data 7.2.4-1~exp1 all ROCm library for basic linear algebra - test data
ii librocm-core-dev 7.2.4-1~exp1 amd64 provides methods to get information about installed ROCm - headers
ii librocm-core1 7.2.4-1~exp1 amd64 provides methods to get information about installed ROCm
ii librocm-smi64-7 7.2.4-1~exp1 amd64 ROCm System Management Interface (ROCm SMI) library
ii librocsolver-dev 7.2.4-1~exp1 amd64 ROCm library for numerical linear algebra - headers
ii librocsolver0 7.2.4-1~exp1 amd64 ROCm library for numerical linear algebra - library
ii librocsolver0-tests 7.2.4-1~exp1 amd64 ROCm library for numerical linear algebra - tests
ii libroctx64-4 7.2.4-1~exp1 amd64 ROCm library for profiler annotations - library
ii rocm-device-libs-22 7.2.4+dfsg-1~exp1 amd64 AMD specific device-side language runtime libraries
ii rocm-opencl-icd:amd64 7.2.4-1~exp1 amd64 AMD ROCm implementation of the OpenCL API - ICD runtime
ii rocminfo 7.2.4-1~exp1 amd64 ROCm Application for Reporting System Info
Bert Karwatzki
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-07-03 12:44 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-03 12:44 GPU reset when running the ROCm hsa runtime tests on gfx12 and next-20260701 Bert Karwatzki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox