The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* GPU reset when running the ROCm hsa runtime tests on gfx12 and next-20260701
@ 2026-07-03 12:44 Bert Karwatzki
  0 siblings, 0 replies; only message in thread
From: Bert Karwatzki @ 2026-07-03 12:44 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Bert Karwatzki, linux-kernel, amd-gfx, linux-next, Jesse Zhang,
	Amber Lin, Mario Limonciello

When running the ROCm hsa selftests on debian unstable (with ROCm packages from experimental)
with linux next-20260701, the GPU will reset with the following error message:

$ /usr/libexec/rocm/libhsa-runtime64-tests/run-tests

[ T2548] amdgpu 0000:03:00.0: failed to suspend gangs from MES
[ T2548] amdgpu 0000:03:00.0: MES might be in unrecoverable state, issue a GPU reset
[  T360] amdgpu 0000:03:00.0: GPU reset begin!. Source:  3
[T28252] amdgpu 0000:03:00.0: Failed to evict queue 0
[T28252] amdgpu: Failed to quiesce KFD
[  T360] amdgpu 0000:03:00.0: Dumping IP State
[  T360] amdgpu 0000:03:00.0: Dumping IP State Completed
[  T360] amdgpu 0000:03:00.0: MODE1 reset
[  T360] amdgpu 0000:03:00.0: GPU mode1 reset
[  T360] amdgpu 0000:03:00.0: GPU smu mode1 reset
[  T360] amdgpu 0000:03:00.0: GPU reset succeeded, trying to resume
[drm] PCIE GART of 512M enabled (table at 0x00000083DAB00000).
[drm] AMDGPU device coredump file has been created
[drm] Check your /sys/class/drm/card0/device/devcoredump/data
[  T360] amdgpu 0000:03:00.0: VRAM is lost due to GPU reset!
[  T360] amdgpu 0000:03:00.0: PSP is resuming...
[  T360] amdgpu 0000:03:00.0: RAS: optional ras ta ucode is not available
[  T360] amdgpu 0000:03:00.0: RAP: optional rap ta ucode is not available
[  T360] amdgpu 0000:03:00.0: SECUREDISPLAY: optional securedisplay ta ucode is not available
[  T360] amdgpu 0000:03:00.0: SMU is resuming...
[  T360] amdgpu 0000:03:00.0: SMU is resumed successfully!
[  T360] amdgpu 0000:03:00.0: program CP_MES_CNTL : 0x4000000
[  T360] amdgpu 0000:03:00.0: program CP_MES_CNTL : 0xc000000
[drm] DMUB hardware initialized: version=0x00010300
[  T360] amdgpu 0000:03:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  T360] amdgpu 0000:03:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  T360] amdgpu 0000:03:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  T360] amdgpu 0000:03:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  T360] amdgpu 0000:03:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  T360] amdgpu 0000:03:00.0: ring sdma0 uses VM inv eng 9 on hub 0
[  T360] amdgpu 0000:03:00.0: ring sdma1 uses VM inv eng 10 on hub 0
[  T360] amdgpu 0000:03:00.0: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[  T360] amdgpu 0000:03:00.0: ring jpeg_dec uses VM inv eng 1 on hub 8
[  T360] amdgpu 0000:03:00.0: GPU reset(1) succeeded!
[drm] device wedged, but no recovery needed
[ T2050] amdgpu 0000:03:00.0: VM memory stats for proc xfwm4(2105) task xfwm4:cs0(2050) is non-zero when fini
[ T1932] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T2293] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T1929] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T1930] amdgpu 0000:03:00.0: VM memory stats for proc (0) task (0) is non-zero when fini
[ T2263] amdgpu 0000:03:00.0: VM memory stats for proc firefox-esr(3483) task firefox-es:cs0(2263) is non-zero when fini
[ T3590] amdgpu 0000:03:00.0: VM memory stats for proc RDD Process(4755) task firefox-es:cs0(3590) is non-zero when fini
[ T1469] amdgpu 0000:03:00.0: VM memory stats for proc Xorg(1495) task Xorg:cs0(1469) is non-zero when fini


As this does not occur on v7.2-rc1 I bisected the issue and found
commit f94bbd648bb4 ("drm/amdgpu: use a single entry point for mes compute reset")
as responsible.
Unfortunately the does not revert cleanly in linux-next-20260701, so I had to revert these
commits to fix the issue in next-20260701:

b789664e3e30 ("drm/amdkfd: Clean up suspend_all and resume_all mes")
a665d09b10af ("drm/amdkfd: Pass known bad queue info to reset")
a4e4d945cba8 ("drm/amdgpu/gfx: defer per-queue helper_end until after MES resume")
f401a2633e02 ("drm/amdgpu: Remove faulty queue before resume")
f94bbd648bb4 ("drm/amdgpu: use a single entry point for mes compute reset")

GPU used:
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 44 [Radeon RX 9060 XT] [1002:7590] (rev c0)
CPU used:
model name	: AMD Ryzen 9 9950X 16-Core Processor

Installed ROCm packages (from debian experimental)

bert@homer:~$ dpkg -l | grep 7.2.4
ii  hip-utils                                                        7.2.4-1~exp1                             amd64        utilities for HIP language development
ii  hipcc                                                            7.2.4+dfsg-1~exp1                        amd64        C++ Runtime API and Kernel Language for AMD and NVIDIA GPUs
ii  libamd-comgr-dev                                                 7.2.4+dfsg-1~exp1                        amd64        ROCm code object manager - development package
ii  libamd-comgr3:amd64                                              7.2.4+dfsg-1~exp1                        amd64        ROCm code object manager
ii  libamdhip64-7:amd64                                              7.2.4-1~exp1                             amd64        HIP runtime for AMD GPUs - library
ii  libamdhip64-dev                                                  7.2.4-1~exp1                             amd64        HIP runtime for AMD GPUs - headers
ii  libhiprtc-builtins7:amd64                                        7.2.4-1~exp1                             amd64        HIP runtime compilation library - builtins
ii  libhiprtc7:amd64                                                 7.2.4-1~exp1                             amd64        HIP runtime compilation library
ii  libhsa-runtime-dev:amd64                                         7.2.4+dfsg-1~exp1                        amd64        HSA Runtime API and runtime for ROCm - development files
ii  libhsa-runtime64-1:amd64                                         7.2.4+dfsg-1~exp1                        amd64        HSA Runtime API and runtime for ROCm
ii  libhsa-runtime64-tests                                           7.2.4+dfsg-1~exp1                        amd64        HSA Runtime test suites for ROCm
ii  librocblas-dev                                                   7.2.4-1~exp1                             amd64        ROCm library for basic linear algebra - headers
ii  librocblas-doc                                                   7.2.4-1~exp1                             all          ROCm library for basic linear algebra - documentation
ii  librocblas5                                                      7.2.4-1~exp1                             amd64        ROCm library for basic linear algebra - library
ii  librocblas5-bench                                                7.2.4-1~exp1                             amd64        ROCm library for basic linear algebra - benchmarks
ii  librocblas5-tests                                                7.2.4-1~exp1                             amd64        ROCm library for basic linear algebra - tests
ii  librocblas5-tests-data                                           7.2.4-1~exp1                             all          ROCm library for basic linear algebra - test data
ii  librocm-core-dev                                                 7.2.4-1~exp1                             amd64        provides methods to get information about installed ROCm - headers
ii  librocm-core1                                                    7.2.4-1~exp1                             amd64        provides methods to get information about installed ROCm
ii  librocm-smi64-7                                                  7.2.4-1~exp1                             amd64        ROCm System Management Interface (ROCm SMI) library
ii  librocsolver-dev                                                 7.2.4-1~exp1                             amd64        ROCm library for numerical linear algebra - headers
ii  librocsolver0                                                    7.2.4-1~exp1                             amd64        ROCm library for numerical linear algebra - library
ii  librocsolver0-tests                                              7.2.4-1~exp1                             amd64        ROCm library for numerical linear algebra - tests
ii  libroctx64-4                                                     7.2.4-1~exp1                             amd64        ROCm library for profiler annotations - library
ii  rocm-device-libs-22                                              7.2.4+dfsg-1~exp1                        amd64        AMD specific device-side language runtime libraries
ii  rocm-opencl-icd:amd64                                            7.2.4-1~exp1                             amd64        AMD ROCm implementation of the OpenCL API - ICD runtime
ii  rocminfo                                                         7.2.4-1~exp1                             amd64        ROCm Application for Reporting System Info


Bert Karwatzki
    

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-07-03 12:44 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-03 12:44 GPU reset when running the ROCm hsa runtime tests on gfx12 and next-20260701 Bert Karwatzki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox