All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover
@ 2023-10-26 13:34 Yifan Zhang
  2023-10-26 14:05 ` Christian König
  0 siblings, 1 reply; 2+ messages in thread
From: Yifan Zhang @ 2023-10-26 13:34 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Tim.Huang, Yifan Zhang, christian.koenig,
	li.ma

gpu tlb flush is skipped if reset sem is held, it makes
mes_self_test fail since it involves add_hw_queue/remove_hw_queue
which needs tlb flush functional. Remove mes_self_test in gpu
recover sequence.

This patch is to fix the recover failure in gfx11.

[ 1831.768292] [drm] ring sdma_32769.3.3 was added
[ 1831.768313] [drm] ring gfx_32769.1.1 ib test pass
[ 1831.768337] [drm] ring compute_32769.2.2 ib test pass
[ 1831.768399] amdgpu 0000:c2:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process  pid 0 thread  pid 0)
[ 1831.768434] amdgpu 0000:c2:00.0: amdgpu:   in page starting at address 0x0000aec200000000 from client 10
[ 1831.768456] amdgpu 0000:c2:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800A30
[ 1831.768473] amdgpu 0000:c2:00.0: amdgpu:      Faulty UTCL2 client ID: CPC (0x5)
[ 1831.768489] amdgpu 0000:c2:00.0: amdgpu:      MORE_FAULTS: 0x0
[ 1831.768501] amdgpu 0000:c2:00.0: amdgpu:      WALKER_ERROR: 0x0
[ 1831.768513] amdgpu 0000:c2:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[ 1831.768521] amdgpu 0000:c2:00.0: amdgpu:      MAPPING_ERROR: 0x0
[ 1831.768529] amdgpu 0000:c2:00.0: amdgpu:      RW: 0x0
[ 1831.931229] amdgpu 0000:c2:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma_32769.3.3 test failed (-110)
[ 1832.062917] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 1832.063107] [drm:amdgpu_mes_remove_hw_queue [amdgpu]] *ERROR* failed to remove hardware queue, queue id = 3

Fixes: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb v2")
Reported-by: Li Ma <li.ma@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7ec32b44df05..5169e55b7fd2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5557,10 +5557,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			drm_sched_start(&ring->sched, true);
 		}
 
-		if (adev->enable_mes &&
-		    amdgpu_ip_version(adev, GC_HWIP, 0) != IP_VERSION(11, 0, 3))
-			amdgpu_mes_self_test(tmp_adev);
-
 		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
 			drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-10-26 14:05 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-26 13:34 [PATCH] drm/amdgpu: remove amdgpu_mes_self_test in gpu recover Yifan Zhang
2023-10-26 14:05 ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.